rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
358 stars 31 forks source link

`emmeans()` error on multiply imputed coxph models #392

Closed walrossker closed 1 year ago

walrossker commented 1 year ago

Description

After multiple imputation with the mice package, the emmeans function is not working on the mira object of survival::coxph models (i.e., it produces the error shown below: no applicable method for 'recover_data' applied to an object of class "coxph"). Both mira objects and coxph models are compatible with emmeans, but maybe not the combination of the two?

Reprex

# Create raw dataset
dat <- survival::lung
dat[seq(1, nrow(dat), by = 10), "age"] <- as.numeric(NA)
dat$sex <- factor(dat$sex)
dat <- dat[, c("status", "time", "age", "sex")]

# Impute missing data points
imp <- mice::mice(dat, method = "cart", printFlag = FALSE)

# Run cox regression on each imputed dataset
mods <- with(imp, survival::coxph(survival::Surv(time = time, event = status) ~ age + sex))

# Try to get marginal means
emmeans::emmeans(mods, "sex")
#> Error in UseMethod("recover_data") : 
#>   no applicable method for 'recover_data' applied to an object of class "coxph"
#> Error in (function (object, at, cov.reduce = mean, cov.keep = get_emm_option("cov.keep"), : Perhaps a 'data' or 'params' argument is needed

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.7.0      zoo_1.8-10        tidyselect_1.1.1  xfun_0.23        
#>  [5] purrr_0.3.4       splines_4.0.4     lattice_0.20-41   vctrs_0.5.1      
#>  [9] generics_0.1.0    htmltools_0.5.4   yaml_2.3.5        utf8_1.2.2       
#> [13] survival_3.4-0    rlang_1.0.6       R.oo_1.25.0       pillar_1.7.0     
#> [17] glue_1.6.2        withr_2.5.0       DBI_1.1.1         R.utils_2.12.0   
#> [21] multcomp_1.4-20   emmeans_1.7.3     R.cache_0.16.0    lifecycle_1.0.3  
#> [25] stringr_1.4.0     R.methodsS3_1.8.2 mvtnorm_1.1-3     codetools_0.2-18 
#> [29] coda_0.19-4       evaluate_0.14     knitr_1.33        fastmap_1.1.0    
#> [33] fansi_0.5.0       highr_0.9         TH.data_1.1-1     broom_0.7.6      
#> [37] Rcpp_1.0.8.3      xtable_1.8-4      backports_1.2.1   fs_1.5.2         
#> [41] digest_0.6.27     stringi_1.7.6     dplyr_1.0.6       grid_4.0.4       
#> [45] cli_3.4.1         tools_4.0.4       sandwich_3.0-2    magrittr_2.0.1   
#> [49] tibble_3.1.2      mice_3.15.0       crayon_1.5.1      tidyr_1.1.3      
#> [53] pkgconfig_2.0.3   MASS_7.3-53       ellipsis_0.3.2    Matrix_1.5-1     
#> [57] estimability_1.3  reprex_2.0.0      assertthat_0.2.1  rmarkdown_2.8    
#> [61] rstudioapi_0.13   R6_2.5.1          rpart_4.1-15      compiler_4.0.4
rvlenth commented 1 year ago

This is really strange. If I look at emmeans:::recover_data.mira in debug mode

debug at ... emmeans/R/multiple-models.R#91: rdlist = lapply(object$analyses, recover_data, ...)
Browse[2]> head(recover_data(object$analyses[[1]]))
  age sex
1  53   1
2  68   1
3  56   1
4  57   1
5  60   1
6  74   1

That is, it wants to do lapply(object$analyses, recover_data, ...), and that fails, yet if I intervene and call recover_data on the first element, it's fine.

Another aspect of this is that I have my own method dispatch that allows for a local override of the method. So, what if I create my own recover_data.coxph method?

recover_data.coxph = function(object, ...) 
    emmeans:::recover_data.survreg(object, ...)

Now I get

> emmeans(mods, "sex")
 sex emmean    SE  df asymp.LCL asymp.UCL
 1    1.136 0.578 Inf   0.00278      2.27
 2    0.625 0.612 Inf  -0.57452      1.82

Results are given on the log (not the response) scale. 
Confidence level used: 0.95

It works! So I guess this is a temporary workaround.

I guess this has something to do with my local dispatch stuff. But I am mystified. Maybe it's related to lazy evaluation somehow.

walrossker commented 1 year ago

Interesting. I've run into issues before at the nexus the survival and mice packages that I can't explain. For example, I often need to explicitly load the survival package with a library() call in order to pool and summarize results from imputed coxph models -- a simple mice::pool call does not work often. Not sure if that's relevant here though.

Thanks for looking into this -- the workaround works on my machine too.

walrossker commented 1 year ago

One follow-up issue I've encountered in my workflow comes from not running models through the with() function, but rather running them manually on each imputed dataset (i.e., iterating on a list of imputed datasets). I do this out of necessity because I cannot successfully run my real model using the with() function (it keeps telling me variables are missing when they aren't).

More specifically, when I run models by supplying a formula object rather than explicitly specifying the formula in code, emmeans says that it is unable to reconstruct the data. My workflow requires that I supply the formula as a formula object rather than hard coding it because the steps before create a formula string based on user-supplied options.

Here is an updated reprex in which the imputation object is converted to a list of datasets, a model is fit on each, and then they are converted into a mira object to prep for emmeans:

# Create raw dataset
dat <- survival::lung
dat[seq(1, nrow(dat), by = 10), "age"] <- as.numeric(NA)
dat$sex <- factor(dat$sex)
# dat$time <- as.numeric(dat$time)
dat <- dat[, c("status", "time", "age", "sex")]

# Impute missing data points
imp <- mice::mice(dat, method = "cart", printFlag = FALSE)

# Create formula for each model
.formula <- as.formula("survival::Surv(time = time, event = status) ~ age + survival::strata(sex)")

# Run cox regression on each imputed dataset in list format
mods <- lapply(miceadds::mids2datlist(imp), function(.x){
  survival::coxph(
    formula = .formula,
    data = .x)
})

recover_data.coxph = function(object, ...){ 
  emmeans:::recover_data.survreg(object, ...)
}

# Try to get marginal means
emmeans::emmeans(mice::as.mira(mods), "sex")
#> Error in (function (object, at, cov.reduce = mean, cov.keep = get_emm_option("cov.keep"), : We are unable to reconstruct the data.
#> The variables needed are:
#>  age sex
#> Are any of these actually constants? (specify via 'params = ')
#> The dataset name is:
#>  .x
#> Does the data still exist? Or you can specify a dataset via 'data = '

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.7.0      zoo_1.8-10        tidyselect_1.1.1  xfun_0.23        
#>  [5] purrr_0.3.4       mitools_2.4       splines_4.0.4     lattice_0.20-41  
#>  [9] vctrs_0.5.1       generics_0.1.0    htmltools_0.5.4   yaml_2.3.5       
#> [13] utf8_1.2.2        survival_3.4-0    rlang_1.0.6       R.oo_1.25.0      
#> [17] pillar_1.7.0      glue_1.6.2        withr_2.5.0       DBI_1.1.1        
#> [21] R.utils_2.12.0    multcomp_1.4-20   emmeans_1.7.3     R.cache_0.16.0   
#> [25] lifecycle_1.0.3   stringr_1.4.0     R.methodsS3_1.8.2 mvtnorm_1.1-3    
#> [29] codetools_0.2-18  coda_0.19-4       evaluate_0.14     knitr_1.33       
#> [33] fastmap_1.1.0     miceadds_3.12-26  fansi_0.5.0       highr_0.9        
#> [37] TH.data_1.1-1     broom_0.7.6       Rcpp_1.0.8.3      xtable_1.8-4     
#> [41] backports_1.2.1   fs_1.5.2          digest_0.6.27     stringi_1.7.6    
#> [45] dplyr_1.0.6       grid_4.0.4        cli_3.4.1         tools_4.0.4      
#> [49] sandwich_3.0-2    magrittr_2.0.1    tibble_3.1.2      mice_3.15.0      
#> [53] crayon_1.5.1      tidyr_1.1.3       pkgconfig_2.0.3   MASS_7.3-53      
#> [57] ellipsis_0.3.2    Matrix_1.5-1      estimability_1.3  reprex_2.0.0     
#> [61] assertthat_0.2.1  rmarkdown_2.8     rstudioapi_0.13   R6_2.5.1         
#> [65] rpart_4.1-15      compiler_4.0.4
rvlenth commented 1 year ago

I'm pretty sure the problem here is that formulas carry an environment with them, and your constructed formula has the wrong environment from the one where the data exist.

And I don't think this can work correctly anyway, because the same symbol .x is reused for each different dataset. We need to be able to reference each imputed dataset and recover each one correctly.

walrossker commented 1 year ago

FWIW, the manual or list-based approach does work when I specify the formula in the coxph function call and retain the .x data reference (see reprex below). Is there any way you can think to complete this without the with() function?

# Create raw dataset
dat <- survival::lung
dat[seq(1, nrow(dat), by = 10), "age"] <- as.numeric(NA)
dat$sex <- factor(dat$sex)
# dat$time <- as.numeric(dat$time)
dat <- dat[, c("status", "time", "age", "sex")]

# Impute missing data points
imp <- mice::mice(dat, method = "cart", printFlag = FALSE)

# Run cox regression on each imputed dataset in list format
mods <- lapply(miceadds::mids2datlist(imp), function(.x){
  survival::coxph(
    formula = survival::Surv(time = time, event = status) ~ age + survival::strata(sex),
    data = .x)
})

recover_data.coxph = function(object, ...){ 
  emmeans:::recover_data.survreg(object, ...)
}

# Try to get marginal means
emmeans::emmeans(mice::as.mira(mods), "sex")
#>  sex emmean    SE  df asymp.LCL asymp.UCL
#>  1    1.140 0.571 Inf    0.0216      2.26
#>  2    0.631 0.604 Inf   -0.5541      1.82
#> 
#> Results are given on the log (not the response) scale. 
#> Confidence level used: 0.95

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.7.0      zoo_1.8-10        tidyselect_1.1.1  xfun_0.23        
#>  [5] purrr_0.3.4       mitools_2.4       splines_4.0.4     lattice_0.20-41  
#>  [9] vctrs_0.5.1       generics_0.1.0    htmltools_0.5.4   yaml_2.3.5       
#> [13] utf8_1.2.2        survival_3.4-0    rlang_1.0.6       R.oo_1.25.0      
#> [17] pillar_1.7.0      glue_1.6.2        withr_2.5.0       DBI_1.1.1        
#> [21] R.utils_2.12.0    multcomp_1.4-20   emmeans_1.7.3     R.cache_0.16.0   
#> [25] lifecycle_1.0.3   stringr_1.4.0     R.methodsS3_1.8.2 mvtnorm_1.1-3    
#> [29] codetools_0.2-18  coda_0.19-4       evaluate_0.14     knitr_1.33       
#> [33] fastmap_1.1.0     miceadds_3.12-26  fansi_0.5.0       highr_0.9        
#> [37] TH.data_1.1-1     broom_0.7.6       Rcpp_1.0.8.3      xtable_1.8-4     
#> [41] backports_1.2.1   fs_1.5.2          digest_0.6.27     stringi_1.7.6    
#> [45] dplyr_1.0.6       grid_4.0.4        cli_3.4.1         tools_4.0.4      
#> [49] sandwich_3.0-2    magrittr_2.0.1    tibble_3.1.2      mice_3.15.0      
#> [53] crayon_1.5.1      tidyr_1.1.3       pkgconfig_2.0.3   MASS_7.3-53      
#> [57] ellipsis_0.3.2    Matrix_1.5-1      estimability_1.3  reprex_2.0.0     
#> [61] assertthat_0.2.1  rmarkdown_2.8     rstudioapi_0.13   R6_2.5.1         
#> [65] rpart_4.1-15      compiler_4.0.4
rvlenth commented 1 year ago

I am lost. The code you show just above that runs successfully does not involve the with() function -- so I'd say that you found a way to do it without use of with().

It could well be that emmeans is retrieving the same data with each model instance, even though the models are based on different datasets. And that might even be OK, in that the retrieved data serves as a reference for getting factor levels, etc.

Also, as I said before, environment(formula) is important for retrieving the data. I think it should be an environment where .x. can be found.

This issue has morphed a couple of times into something else, and I don't know where this is going. I have had almost zero experience with the mice package and I am not sure what issue we are addressing anymore.

walrossker commented 1 year ago

Is there any way you can think to complete this without the with() function?

This question was poorly worded. I meant is there any way to avoid the use of with() AND still use a stored formula like the reprex before it tried to do but failed:

# Create formula for each model
.formula <- as.formula("survival::Surv(time = time, event = status) ~ age + survival::strata(sex)")

# Run cox regression on each imputed dataset in list format
mods <- lapply(miceadds::mids2datlist(imp), function(.x){
  survival::coxph(
    formula = .formula,
    data = .x)
})

recover_data.coxph = function(object, ...){ 
  emmeans:::recover_data.survreg(object, ...)
}

# Try to get marginal means
emmeans::emmeans(mice::as.mira(mods), "sex")
#> Error in (function (object, at, cov.reduce = mean, cov.keep = get_emm_option("cov.keep"), : We are unable to reconstruct the data.
#> The variables needed are:
#>  age sex
#> Are any of these actually constants? (specify via 'params = ')
#> The dataset name is:
#>  .x
#> Does the data still exist? Or you can specify a dataset via 'data = '

Based on your comments, it seems like I might just need to learn more about how formulae specify and/or carry environments in order to make this approach work.

rvlenth commented 1 year ago

Meanwhile, I have modified the method dispatch code so now it works without specifying a local method. It'll be in the next push.

rvlenth commented 1 year ago

I think this may work:

mods <- lapply(miceadds::mids2datlist(imp), function(.x) {
    environment(.formula) <- environment()
  survival::coxph(
    formula = .formula,
    data = .x)
})
walrossker commented 1 year ago

That worked, thanks!

rvlenth commented 1 year ago

I'm glad. And it suggests that you do get the different datasets. I did a small test:

> tst = lapply(1:4, function(x) environment())
> tst
[[1]]
<environment: 0x00000144513d8fd0>

[[2]]
<environment: 0x00000144513d8c88>

[[3]]
<environment: 0x00000144514cbe08>

[[4]]
<environment: 0x00000144514cbba0>

> tst[[3]] $ x
[1] 3

This confirms that lapply() creates different environments, and the value of x is different in each.

rvlenth commented 1 year ago

I think this is completed, so am closing.