kgoldfeld / simstudy

simstudy: Illuminating research methods through data generation
https://kgoldfeld.github.io/simstudy/
GNU General Public License v3.0
81 stars 7 forks source link

External variable in `logisticCoefs` call not recognized when inside function call #220

Closed moosterwegel closed 7 months ago

moosterwegel commented 7 months ago

First of all, thank you for providing an R package that makes it trivial to simulate some data with a certain marginal probability. Much appreciated.

I'm running into an issue with logisticCoefs when simulating p(x) p(m|x) p(y|x,m). I want to define the strength of the effect of x on m on the fly, but I can only get it to work outside of a function call. The variable that sets the strength of the effect is not recognized inside the function call. See the reprex below.

Any pointers to what's going wrong?

Thank you.

library(simstudy)
set.seed(2024)

# p(x) p(m) p(y|x,m)
f1 <- function() {
  n <- 2000
  prevalence <- 0.1
  coefs_y <- c(log(3), log(2))

  defs <- defData(
    defData(varname = "x", formula = 0, variance = 1, dist = "normal"), 
    varname = "m",
    formula = 0,
    dist = "normal", variance = 1)

  intercept <- logisticCoefs(defCovar = defs, coefs = coefs_y, popPrev = prevalence)[['B0']]

  defs <- defData(
    defs,
    varname = "y",
    formula = "t(c(..intercept, ..coefs_y)) %*% c(1, x, m)",
    dist = "binary", link = "logit")

  df <- genData(n, defs)
  glm(y ~ 1 + x + m, family = binomial(link = "logit"), data = df) |> coef() |> exp()
}

f1()
#> (Intercept)           x           m 
#>  0.05879635  2.87530784  1.83699935

# p(x) p(m|x) p(y|x,m)
f2 <- function() {
  n <- 2000
  prevalence <- 0.1
  coefs_y <- c(log(3), log(2))
  beta_x_m <- 2 

  defs <- defData(
    defData(varname = "x", formula = 0, variance = 1, dist = "normal"), 
    varname = "m",
    formula = "x * ..beta_x_m", # only difference with f1
    dist = "normal", variance = 1)

  intercept <- logisticCoefs(defCovar = defs, coefs = coefs_y, popPrev = prevalence)[['B0']]

  defs <- defData(
    defs,
    varname = "y",
    formula = "t(c(..intercept, ..coefs_y)) %*% c(1, x, m)",
    dist = "binary", link = "logit")

  df <- genData(n, defs)
  glm(y ~ 1 + x + m, family = binomial(link = "logit"), data = df) |> coef() |> exp()
}

f2()
#> Error: External variable(s) not defined or NA: beta_x_m

# the code from f2 works fine outside of a function
n <- 2000
prevalence <- 0.1
coefs_y <- c(log(3), log(2))
beta_x_m <- 2

defs <- defData(
  defData(varname = "x", formula = 0, variance = 1, dist = "normal"), 
  varname = "m",
  formula = "x * ..beta_x_m", 
  dist = "normal", variance = 1)

intercept <- logisticCoefs(defCovar = defs, coefs = coefs_y, popPrev = prevalence)[['B0']]

defs <- defData(
  defs,
  varname = "y",
  formula = "t(c(..intercept, ..coefs_y)) %*% c(1, x, m)",
  dist = "binary", link = "logit")

df <- genData(n, defs)
glm(y ~ 1 + x + m, family = binomial(link = "logit"), data = df) |> coef() |> exp()
#> (Intercept)           x           m 
#>  0.01726143  3.29607931  2.04816878
sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] simstudy_0.7.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] backports_1.4.1     digest_0.6.31       fastmap_1.1.1      
#>  [4] xfun_0.42           glue_1.6.2          fastglm_0.0.3      
#>  [7] knitr_1.43          bigmemory_4.6.4     htmltools_0.5.5    
#> [10] rmarkdown_2.22      lifecycle_1.0.3     cli_3.6.1          
#> [13] reprex_2.1.0        data.table_1.14.8   withr_2.5.0        
#> [16] compiler_4.3.1      bigmemory.sri_0.1.8 rstudioapi_0.14    
#> [19] tools_4.3.1         evaluate_0.21       Rcpp_1.0.10        
#> [22] yaml_2.3.7          rlang_1.1.1         fs_1.6.2           
#> [25] uuid_1.1-0
kgoldfeld commented 7 months ago

Thanks for reaching out. Yes - you've found a bug that we need to fix related to the double-dot notation. Currently, that variable only gets evaulated if it exists in the "root" environment. So, when you create beta_x_m inside the function, the double dot functionality does not find it - it is looking for it in the calling environment. So, a workaround until we fix this is to define beta_x_m just before calling f2(), like this:

library(simstudy)
set.seed(2024)

# p(x) p(m|x) p(y|x,m)
f2 <- function() {
  n <- 2000
  prevalence <- 0.1
  coefs_y <- c(log(3), log(2))

  defs <- defData(
    defData(varname = "x", formula = 0, variance = 1, dist = "normal"), 
    varname = "m",
    formula = "x * ..beta_x_m", # only difference with f1
    dist = "normal", variance = 1)

  intercept <- logisticCoefs(defCovar = defs, coefs = coefs_y, popPrev = prevalence)[['B0']]

  defs <- defData(
    defs,
    varname = "y",
    formula = "t(c(..intercept, ..coefs_y)) %*% c(1, x, m)",
    dist = "binary", link = "logit")

  df <- genData(n, defs)
  glm(y ~ 1 + x + m, family = binomial(link = "logit"), data = df) |> coef() |> exp()
}

beta_x_m <- 2 
f2()
#> (Intercept)           x           m 
#>  0.01568302  5.22871219  1.60440634

Created on 2024-03-23 with reprex v2.1.0

You can do this with replications. Here is an example with 3 replications where beta_x_m is set to 1.5:

beta_x_m <- 1.5 
lapply(1:3, function(x) f2())
#> [[1]]
#> (Intercept)           x           m 
#>  0.02491957  3.83257775  1.63143803 
#> 
#> [[2]]
#> (Intercept)           x           m 
#>  0.02581665  2.26940339  2.24345931 
#> 
#> [[3]]
#> (Intercept)           x           m 
#>   0.0171123   3.6076806   1.9789129

Created on 2024-03-23 with reprex v2.1.0

I'll let you know when we've fixed the bigger issue.

kgoldfeld commented 7 months ago

Just to follow up - this bug is related specifically to function logisticCoefs. In general one can pass arguments to functions and refer to them in double dot notation:

library(simstudy)

fa <- function(a) {

  def <- 
    defData(varname = "x", formula = 4, variance = 2)  |>
    defData(varname = "z", formula = "x*..a", variance = 3)

  genData(1000, def)[, mean(z)]
}

set.seed(123123)

fa(5)
#> [1] 19.96077
fa(20)
#> [1] 80.67964

Created on 2024-03-23 with reprex v2.1.0

kgoldfeld commented 7 months ago

@assignUser I see the problem - in the logisticCoefs, the algorithm requires that a dataset using the specified data definition is generated. Of course, the double dot variable has not been passed as an argument in the call to logisticCoefs, resulting in the error message. The question is, is there a way to pass those variables (in this case beta_x_m) without the user explicitly specifying it. I'm guessing not - so the solution would require the user to attach the reference variables as an argument.

kgoldfeld commented 7 months ago

@assignUser I think I see a possible solution: the get function allows us to pull variables from the parent environment. So, if I can extract all the double dot references from the data definition, I can probably pull those variables into the environment.

assignUser commented 7 months ago

Yeah that sounds like it would work, I can look over the issue in a bit too!

kgoldfeld commented 7 months ago

I've pretty much got it working - will incorporate into logisticCoefs soon.

assignUser commented 7 months ago

Oh great, feel free to ping me for a review. I had to reset my dev environment (thanks windows...)

kgoldfeld commented 7 months ago

This has been fixed - install from development. (No need for workaround provided above.)

moosterwegel commented 7 months ago

That's quick, superb. Thanks!