Closed walrossker closed 1 year ago
This is really strange. If I look at emmeans:::recover_data.mira
in debug mode
debug at ... emmeans/R/multiple-models.R#91: rdlist = lapply(object$analyses, recover_data, ...)
Browse[2]> head(recover_data(object$analyses[[1]]))
age sex
1 53 1
2 68 1
3 56 1
4 57 1
5 60 1
6 74 1
That is, it wants to do lapply(object$analyses, recover_data, ...)
, and that fails, yet if I intervene and call recover_data
on the first element, it's fine.
Another aspect of this is that I have my own method dispatch that allows for a local override of the method. So, what if I create my own recover_data.coxph
method?
recover_data.coxph = function(object, ...)
emmeans:::recover_data.survreg(object, ...)
Now I get
> emmeans(mods, "sex")
sex emmean SE df asymp.LCL asymp.UCL
1 1.136 0.578 Inf 0.00278 2.27
2 0.625 0.612 Inf -0.57452 1.82
Results are given on the log (not the response) scale.
Confidence level used: 0.95
It works! So I guess this is a temporary workaround.
I guess this has something to do with my local dispatch stuff. But I am mystified. Maybe it's related to lazy evaluation somehow.
Interesting. I've run into issues before at the nexus the survival
and mice
packages that I can't explain. For example, I often need to explicitly load the survival
package with a library()
call in order to pool and summarize results from imputed coxph
models -- a simple mice::pool
call does not work often. Not sure if that's relevant here though.
Thanks for looking into this -- the workaround works on my machine too.
One follow-up issue I've encountered in my workflow comes from not running models through the with()
function, but rather running them manually on each imputed dataset (i.e., iterating on a list of imputed datasets). I do this out of necessity because I cannot successfully run my real model using the with()
function (it keeps telling me variables are missing when they aren't).
More specifically, when I run models by supplying a formula object rather than explicitly specifying the formula in code, emmeans
says that it is unable to reconstruct the data
. My workflow requires that I supply the formula as a formula object rather than hard coding it because the steps before create a formula string based on user-supplied options.
Here is an updated reprex in which the imputation object is converted to a list of datasets, a model is fit on each, and then they are converted into a mira
object to prep for emmeans
:
# Create raw dataset
dat <- survival::lung
dat[seq(1, nrow(dat), by = 10), "age"] <- as.numeric(NA)
dat$sex <- factor(dat$sex)
# dat$time <- as.numeric(dat$time)
dat <- dat[, c("status", "time", "age", "sex")]
# Impute missing data points
imp <- mice::mice(dat, method = "cart", printFlag = FALSE)
# Create formula for each model
.formula <- as.formula("survival::Surv(time = time, event = status) ~ age + survival::strata(sex)")
# Run cox regression on each imputed dataset in list format
mods <- lapply(miceadds::mids2datlist(imp), function(.x){
survival::coxph(
formula = .formula,
data = .x)
})
recover_data.coxph = function(object, ...){
emmeans:::recover_data.survreg(object, ...)
}
# Try to get marginal means
emmeans::emmeans(mice::as.mira(mods), "sex")
#> Error in (function (object, at, cov.reduce = mean, cov.keep = get_emm_option("cov.keep"), : We are unable to reconstruct the data.
#> The variables needed are:
#> age sex
#> Are any of these actually constants? (specify via 'params = ')
#> The dataset name is:
#> .x
#> Does the data still exist? Or you can specify a dataset via 'data = '
sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] styler_1.7.0 zoo_1.8-10 tidyselect_1.1.1 xfun_0.23
#> [5] purrr_0.3.4 mitools_2.4 splines_4.0.4 lattice_0.20-41
#> [9] vctrs_0.5.1 generics_0.1.0 htmltools_0.5.4 yaml_2.3.5
#> [13] utf8_1.2.2 survival_3.4-0 rlang_1.0.6 R.oo_1.25.0
#> [17] pillar_1.7.0 glue_1.6.2 withr_2.5.0 DBI_1.1.1
#> [21] R.utils_2.12.0 multcomp_1.4-20 emmeans_1.7.3 R.cache_0.16.0
#> [25] lifecycle_1.0.3 stringr_1.4.0 R.methodsS3_1.8.2 mvtnorm_1.1-3
#> [29] codetools_0.2-18 coda_0.19-4 evaluate_0.14 knitr_1.33
#> [33] fastmap_1.1.0 miceadds_3.12-26 fansi_0.5.0 highr_0.9
#> [37] TH.data_1.1-1 broom_0.7.6 Rcpp_1.0.8.3 xtable_1.8-4
#> [41] backports_1.2.1 fs_1.5.2 digest_0.6.27 stringi_1.7.6
#> [45] dplyr_1.0.6 grid_4.0.4 cli_3.4.1 tools_4.0.4
#> [49] sandwich_3.0-2 magrittr_2.0.1 tibble_3.1.2 mice_3.15.0
#> [53] crayon_1.5.1 tidyr_1.1.3 pkgconfig_2.0.3 MASS_7.3-53
#> [57] ellipsis_0.3.2 Matrix_1.5-1 estimability_1.3 reprex_2.0.0
#> [61] assertthat_0.2.1 rmarkdown_2.8 rstudioapi_0.13 R6_2.5.1
#> [65] rpart_4.1-15 compiler_4.0.4
I'm pretty sure the problem here is that formulas carry an environment with them, and your constructed formula has the wrong environment from the one where the data exist.
And I don't think this can work correctly anyway, because the same symbol .x
is reused for each different dataset. We need to be able to reference each imputed dataset and recover each one correctly.
FWIW, the manual or list-based approach does work when I specify the formula in the coxph
function call and retain the .x
data reference (see reprex below). Is there any way you can think to complete this without the with()
function?
# Create raw dataset
dat <- survival::lung
dat[seq(1, nrow(dat), by = 10), "age"] <- as.numeric(NA)
dat$sex <- factor(dat$sex)
# dat$time <- as.numeric(dat$time)
dat <- dat[, c("status", "time", "age", "sex")]
# Impute missing data points
imp <- mice::mice(dat, method = "cart", printFlag = FALSE)
# Run cox regression on each imputed dataset in list format
mods <- lapply(miceadds::mids2datlist(imp), function(.x){
survival::coxph(
formula = survival::Surv(time = time, event = status) ~ age + survival::strata(sex),
data = .x)
})
recover_data.coxph = function(object, ...){
emmeans:::recover_data.survreg(object, ...)
}
# Try to get marginal means
emmeans::emmeans(mice::as.mira(mods), "sex")
#> sex emmean SE df asymp.LCL asymp.UCL
#> 1 1.140 0.571 Inf 0.0216 2.26
#> 2 0.631 0.604 Inf -0.5541 1.82
#>
#> Results are given on the log (not the response) scale.
#> Confidence level used: 0.95
sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] styler_1.7.0 zoo_1.8-10 tidyselect_1.1.1 xfun_0.23
#> [5] purrr_0.3.4 mitools_2.4 splines_4.0.4 lattice_0.20-41
#> [9] vctrs_0.5.1 generics_0.1.0 htmltools_0.5.4 yaml_2.3.5
#> [13] utf8_1.2.2 survival_3.4-0 rlang_1.0.6 R.oo_1.25.0
#> [17] pillar_1.7.0 glue_1.6.2 withr_2.5.0 DBI_1.1.1
#> [21] R.utils_2.12.0 multcomp_1.4-20 emmeans_1.7.3 R.cache_0.16.0
#> [25] lifecycle_1.0.3 stringr_1.4.0 R.methodsS3_1.8.2 mvtnorm_1.1-3
#> [29] codetools_0.2-18 coda_0.19-4 evaluate_0.14 knitr_1.33
#> [33] fastmap_1.1.0 miceadds_3.12-26 fansi_0.5.0 highr_0.9
#> [37] TH.data_1.1-1 broom_0.7.6 Rcpp_1.0.8.3 xtable_1.8-4
#> [41] backports_1.2.1 fs_1.5.2 digest_0.6.27 stringi_1.7.6
#> [45] dplyr_1.0.6 grid_4.0.4 cli_3.4.1 tools_4.0.4
#> [49] sandwich_3.0-2 magrittr_2.0.1 tibble_3.1.2 mice_3.15.0
#> [53] crayon_1.5.1 tidyr_1.1.3 pkgconfig_2.0.3 MASS_7.3-53
#> [57] ellipsis_0.3.2 Matrix_1.5-1 estimability_1.3 reprex_2.0.0
#> [61] assertthat_0.2.1 rmarkdown_2.8 rstudioapi_0.13 R6_2.5.1
#> [65] rpart_4.1-15 compiler_4.0.4
I am lost. The code you show just above that runs successfully does not involve the with()
function -- so I'd say that you found a way to do it without use of with()
.
It could well be that emmeans
is retrieving the same data with each model instance, even though the models are based on different datasets. And that might even be OK, in that the retrieved data serves as a reference for getting factor levels, etc.
Also, as I said before, environment(formula)
is important for retrieving the data. I think it should be an environment where .x.
can be found.
This issue has morphed a couple of times into something else, and I don't know where this is going. I have had almost zero experience with the mice package and I am not sure what issue we are addressing anymore.
Is there any way you can think to complete this without the
with()
function?
This question was poorly worded. I meant is there any way to avoid the use of with()
AND still use a stored formula like the reprex before it tried to do but failed:
# Create formula for each model .formula <- as.formula("survival::Surv(time = time, event = status) ~ age + survival::strata(sex)") # Run cox regression on each imputed dataset in list format mods <- lapply(miceadds::mids2datlist(imp), function(.x){ survival::coxph( formula = .formula, data = .x) }) recover_data.coxph = function(object, ...){ emmeans:::recover_data.survreg(object, ...) } # Try to get marginal means emmeans::emmeans(mice::as.mira(mods), "sex") #> Error in (function (object, at, cov.reduce = mean, cov.keep = get_emm_option("cov.keep"), : We are unable to reconstruct the data. #> The variables needed are: #> age sex #> Are any of these actually constants? (specify via 'params = ') #> The dataset name is: #> .x #> Does the data still exist? Or you can specify a dataset via 'data = '
Based on your comments, it seems like I might just need to learn more about how formulae specify and/or carry environments in order to make this approach work.
Meanwhile, I have modified the method dispatch code so now it works without specifying a local method. It'll be in the next push.
I think this may work:
mods <- lapply(miceadds::mids2datlist(imp), function(.x) {
environment(.formula) <- environment()
survival::coxph(
formula = .formula,
data = .x)
})
That worked, thanks!
I'm glad. And it suggests that you do get the different datasets. I did a small test:
> tst = lapply(1:4, function(x) environment())
> tst
[[1]]
<environment: 0x00000144513d8fd0>
[[2]]
<environment: 0x00000144513d8c88>
[[3]]
<environment: 0x00000144514cbe08>
[[4]]
<environment: 0x00000144514cbba0>
> tst[[3]] $ x
[1] 3
This confirms that lapply()
creates different environments, and the value of x
is different in each.
I think this is completed, so am closing.
Description
After multiple imputation with the
mice
package, theemmeans
function is not working on themira
object ofsurvival::coxph
models (i.e., it produces the error shown below:no applicable method for 'recover_data' applied to an object of class "coxph"
). Bothmira
objects andcoxph
models are compatible withemmeans
, but maybe not the combination of the two?Reprex