rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
341 stars 30 forks source link

The combination of emmeans + lm + mice works, but gls + mice - does not. #406

Closed Generalized closed 11 months ago

Generalized commented 1 year ago

Dear Professor,

In one of the GitHub closed issues I found that emmeans supports the "mice" imputation framework. I also know it supports the gls and mmrm packages.

I just tried to confirm it a simple example. (Please don't mind the nonsensical use of the gls without any correlation or weights. Just wanted to obtain class "gls", rather than meaningful results)

library(mice)
library(nlme)
library(broom.mixed)  # for gls and mice
library(parameters)

nhanes$hyp <- factor(nhanes$hyp)

# imputation
nhanes_miss <- mice(nhanes, m = 5, method = "pmm", seed = 123)

OK, before we start, let's just make sure the emmeans still talks with nlme::gls()

m_gls <- gls(bmi ~ hyp + chl, nhanes[complete.cases(nhanes),])  # we will use complete data, only to check
emmeans(m_gls, specs = ~hyp+chl)

hyp chl emmean   SE df lower.CL upper.CL
 1   192   26.7 1.51 10     23.3     30.1
 2   192   26.0 2.91 10     19.5     32.5

Degrees-of-freedom method: df.error 
Confidence level used: 0.95 

Good! Now, let's run the gls over the imputed datasets and pool the results:

imp_m_gls <- with(nhanes_miss, gls(bmi ~ hyp + chl))
gls_pooled <- pool(imp_m_gls)

> gls_pooled
Class: mipo    m = 5 
         term m    estimate         ubar            b            t dfcom       df       riv    lambda       fmi
1 (Intercept) 5 20.93669210 1.687610e+01 8.1516575959 2.665809e+01    22 8.951994 0.5796356 0.3669426 0.4728759
2        hyp2 5  0.14957330 4.935426e+00 3.3808636266 8.992463e+00    22 7.096871 0.8220235 0.4511596 0.5598746
3         chl 5  0.02897774 4.680415e-04 0.0002824508 8.069824e-04    22 7.734648 0.7241686 0.4200103 0.5280697

also verify:
model_parameters(gls_pooled)
> model_parameters(gls_pooled)
# Fixed Effects

Parameter   | Coefficient |   SE |         95% CI |    t |   df |     p
-----------------------------------------------------------------------
(Intercept) |       20.94 | 5.16 | [ 9.25, 32.63] | 4.06 | 8.95 | 0.003
hyp2        |        0.15 | 3.00 | [-6.92,  7.22] | 0.05 | 7.10 | 0.962
chl         |        0.03 | 0.03 | [-0.04,  0.09] | 1.02 | 7.73 | 0.339

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.
Warning messages:
1: Could not recover model data from environment. Please make sure your data is available in your workspace.
  Trying to retrieve data from the model frame now. 
2: Could not get model data. 

OK, let's use the emmeans:

emmeans(imp_m_gls, specs = ~hyp+chl)
Error in UseMethod("recover_data") : 
  no applicable method for 'recover_data' applied to an object of class "gls"
Error in (function (object, at, cov.reduce = mean, cov.keep = get_emm_option("cov.keep"),  : 
  Perhaps a 'data' or 'params' argument is needed

Hmm, let's chek the same with ordinary lm():

imp_m_lm <- with(nhanes_miss, lm(bmi ~ hyp + chl))
lm_pooled <- pool(imp_m_lm)
emmeans(imp_m_lm, specs = ~hyp+chl)

 hyp chl emmean   SE df lower.CL upper.CL
 1   191   26.5 1.16 22     24.1     28.9
 2   191   26.6 2.41 22     21.6     31.6

Confidence level used: 0.95 

So it works for the lm() but not gls().

Is there any hope to make it working together?

rvlenth commented 1 year ago

It's a complicated situation. First, emmeans() and relatives require being able to recover the dataset, but that doesn't seem possible without help, at least for this model:

> recover_data(imp_m_gls)
Error in recover_data.call(fcall, trms, object$na.action, data = data, :
argument "data" is missing, with no default

> recover_data(imp_m_gls, data = nhanes)
   hyp chl
2    1 187
3    1 187
5    1 113
... (output abbreviated) ...

Correspondingly, we can try:

> ref_grid(imp_m_gls)
Error in ref_grid(imp_m_gls) : We are unable to reconstruct the data.
The variables needed are:
    hyp chl
Are any of these actually constants? (specify via 'params = ')
 Try re-running with 'data = "<name of dataset>"'

### Then follow up with what is suggested (which you could have tried...)
> ref_grid(imp_m_gls, data = nhanes)
Error in eval(predvars, data, env) : object 'hyp' not found

We still get an error because for gls models, we also use the data in the emm_basis.gls() method. There is a mechanism for that in recover_data.gls(), but for imp_m_gls, we are using recover_data.mira() instead. I think I found a workaround; at least it works with this example:

> emmeans(imp_m_gls, specs = ~hyp+chl, data = nhanes)
 hyp chl emmean   SE df lower.CL upper.CL
 1   192   26.5 1.17 22     24.1     28.9
 2   192   26.6 2.40 22     21.7     31.6

Degrees-of-freedom method: df.error 
Confidence level used: 0.95 

Note you do have to specify the data argument here.

Generalized commented 1 year ago

Thank you very much.

Alas, I cannot reproduce this result:

> imp_m_gls
call :
with.mids(data = nhanes_miss, expr = gls(bmi ~ hyp + chl))

call1 :
mice(data = nhanes, m = 5, method = "pmm", seed = 123)

nmis :
age bmi hyp chl  id tim 
  0   9   8  10   0   0 

analyses :
[[1]]
Generalized least squares fit by REML
  Model: bmi ~ hyp + chl 
  Data: NULL 
  Log-restricted-likelihood: -67.83865

Coefficients:
(Intercept)        hyp2         chl 
18.27884063  0.90312733  0.03547781 

Degrees of freedom: 25 total; 22 residual
Residual standard error: 3.796034 

[[2]]
Generalized least squares fit by REML
  Model: bmi ~ hyp + chl 
  Data: NULL 
  Log-restricted-likelihood: -68.21973

Coefficients:
(Intercept)        hyp2         chl 
18.49466702  0.80839826  0.03368811 

Degrees of freedom: 25 total; 22 residual
Residual standard error: 3.86554 

[[3]]
Generalized least squares fit by REML
  Model: bmi ~ hyp + chl 
  Data: NULL 
  Log-restricted-likelihood: -68.83631

Coefficients:
(Intercept)        hyp2         chl 
 23.3494844   0.3539216   0.0140681 

Degrees of freedom: 25 total; 22 residual
Residual standard error: 3.922864 

[[4]]
Generalized least squares fit by REML
  Model: bmi ~ hyp + chl 
  Data: NULL 
  Log-restricted-likelihood: -69.15577

Coefficients:
(Intercept)        hyp2         chl 
18.32612582 -0.87559308  0.03520585 

Degrees of freedom: 25 total; 22 residual
Residual standard error: 4.004991 

[[5]]
Generalized least squares fit by REML
  Model: bmi ~ hyp + chl 
  Data: NULL 
  Log-restricted-likelihood: -71.61168

Coefficients:
(Intercept)        hyp2         chl 
20.73375882 -2.96109405  0.04188712 

Degrees of freedom: 25 total; 22 residual
Residual standard error: 4.506599 

and

> pool(imp_m_gls)
Class: mipo    m = 5 
         term m   estimate         ubar            b            t dfcom        df       riv    lambda       fmi
1 (Intercept) 5 19.8365753 2.081502e+01 4.9135453308 2.671128e+01    22 13.230280 0.2832692 0.2207403 0.3167657
2        hyp2 5 -0.3542480 5.744374e+00 2.6254225741 8.894882e+00    22  9.270615 0.5484509 0.3541933 0.4594540
3         chl 5  0.0320654 6.100493e-04 0.0001111273 7.434021e-04    22 14.651663 0.2185934 0.1793817 0.2723609

> summary(pool(imp_m_gls))
         term   estimate std.error  statistic        df     p.value
1 (Intercept) 19.8365753 5.1682954  3.8381272 13.230280 0.001992168
2        hyp2 -0.3542480 2.9824288 -0.1187784  9.270615 0.907985021
3         chl  0.0320654 0.0272654  1.1760471 14.651663 0.258331829

then

> emmeans(imp_m_gls, specs = ~hyp+chl, data = nhanes)
Error in eval(predvars, data, env) : object 'hyp' not found

This works:

recover_data(imp_m_gls, data = nhanes)
> recover_data(imp_m_gls, data = nhanes)
   hyp chl
2    1 187
3    1 187
5    1 113
7    1 118
8    1 187
9    1 238
13   1 206
14   2 204
17   2 284
18   2 199
19   1 218
22   1 229
23   1 131
25   1 186
Generalized commented 1 year ago

I also checked the geeglm and it worked:

> imp_m_gee <- with(nhanes_miss, geepack::geeglm(bmi ~ hyp + chl, corstr = "exchangeable", id = nhanes$id))

> (gee_pooled <- pool(imp_m_gee))
Class: mipo    m = 5 
         term m   estimate         ubar            b            t dfcom       df       riv    lambda       fmi
1 (Intercept) 5 19.8365753 1.750901e+01 4.9135453308 2.340526e+01    22 12.20835 0.3367555 0.2519200 0.3502976
2        hyp2 5 -0.3542480 3.706792e+00 2.6254225741 6.857299e+00    22  6.93623 0.8499281 0.4594385 0.5682446
3         chl 5  0.0320654 6.292612e-04 0.0001111273 7.626139e-04    22 14.81005 0.2119196 0.1748627 0.2675225

> summary(pool(imp_m_gee))
         term   estimate  std.error  statistic       df    p.value
1 (Intercept) 19.8365753 4.83789860  4.1002462 12.20835 0.00142124
2        hyp2 -0.3542480 2.61864457 -0.1352791  6.93623 0.89623320
3         chl  0.0320654 0.02761547  1.1611391 14.81005 0.26395911

Based on the mice datasets:

> emmeans(imp_m_gee, specs = ~hyp+chl)
 hyp chl emmean   SE  df asymp.LCL asymp.UCL
 1   187   25.8 2.03 Inf      21.8      29.8
 2   187   25.5 1.56 Inf      22.4      28.5

Covariance estimate used: vbeta 
Confidence level used: 0.95 

but when the "data" parameter is added, it takes it from the original data:

> emmeans(imp_m_gee, specs = ~hyp+chl, data=nhanes)
 hyp chl emmean   SE  df asymp.LCL asymp.UCL
 1   192   26.0 2.08 Inf      21.9      30.1
 2   192   25.6 1.47 Inf      22.8      28.5

Covariance estimate used: vbeta 
Confidence level used: 0.95 

so I'm worrying if the trick with gls() wouldn't result in the same issue?

rvlenth commented 1 year ago

I see that would be the case. But I don't see how this can be prevented when it is necessary to specifydata to make it work. You can specify a different dataset with the same variables, but no matter what, it will use the same one dataset. FWIW, if it were able to reconstruct the data, what it would do is create an "average" dataset.

But I would also comment that this can be worked around. Specifying data affects two things: The levels of factors, and the default average values of covariates. The former is not a problem, The latter can be addressed by specifying at = list(chl = 187) in the call.

rvlenth commented 1 year ago

Thinking about this fractionally longer, I suggest that the right thing to do would be to use rbind() to stack all the imputed datasets into one, and specify that in the data argument.

Generalized commented 1 year ago

I found a way to work with "mmrm", so this might temporarily replace nlme::gls()

As I mentioned elsewhere, mmrm doesn't work with "with()".

> imp_m_mmrm <- with(nhanes_miss, mmrm(bmi ~ hyp + chl + us(tim|id)))
Error in attr(data, "dataname") <- toString(match.call()$data) : 
  cannot set attribute on a symbol

So the workaround was to make a list of the datasets and iterate through lapply. This, however, gives a list rather than "mira". But this can be casted as "mira".

> dat_mice <- complete(nhanes_miss, "all")

> fit_mmrm <- function(dat) {
  mod <- mmrm(bmi ~ hyp + chl + us(tim|id),data=dat)
  return(mod)
}

> imp_m_mmrm <- lapply(dat_mice, fit_mmrm)
> summary(pool(imp_m_mmrm))
         term   estimate std.error  statistic        df     p.value
1 (Intercept) 19.8365753 5.1682953  3.8381273 13.230279 0.001992168
2        hyp2 -0.3542480 2.9824287 -0.1187784  9.270615 0.907985019
3         chl  0.0320654 0.0272654  1.1760471 14.651663 0.258331811

> emmeans(as.mira(imp_m_mmrm), specs = ~hyp+chl)
 hyp chl emmean   SE df lower.CL upper.CL
 1   187   25.8 1.98 22     21.7     29.9
 2   187   25.5 2.35 22     20.6     30.3

Confidence level used: 0.95 

and now I can also test the contrasts:

# For the original mmrm using only *complete cases* (13 - 3 parameters -> df =10)
> emmeans::contrast(emmeans(m_mmrm, specs = ~hyp+chl), list(some_contrast = c(1, -1)))
 contrast      estimate  SE df t.ratio p.value
 some_contrast    0.685 3.4 10   0.202  0.8443

# For the imputed datasets (25 - 3 -> df=22)
> emmeans::contrast(emmeans(as.mira(imp_m_mmrm), specs = ~hyp+chl), list(some_contrast = c(1, -1)))
 contrast      estimate   SE df t.ratio p.value
 some_contrast    0.354 2.98 22   0.119  0.9065
rvlenth commented 11 months ago

I think I will close this issue, as it's not clear that more can be done.