rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
340 stars 30 forks source link

What is the calculation of df (degrees of freedom), in the results of `emmeans` for `mira` class of `mice` package #494

Closed kaigu1990 closed 1 month ago

kaigu1990 commented 1 month ago

Thanks for providing such helpful functions for mira class from mice package. When I use the emmeans() function to compute the pooled LS-means, I'm a bit confused about the df column in the results. Let me show you the code below.

low1 <- haven::read_sas("./low1.sas7bdat")
low_wide <- low1 %>%
  tidyr::pivot_wider(
    names_from = week,
    names_prefix = "week",
    values_from = change
  ) %>%
  select(-POOLINV) %>%
  mutate(trt = factor(trt, levels = c("1", "2")))

# create the imputed datasets
low_imp <- mice(low_wide, method = "norm.predict", seed = 12306)
# fit the ancova models
mods <- with(
  low_imp,
  lm(week8 ~ basval + trt)
)
# pool the imputed datasets
pool(mods)
## Class: mipo    m = 5 
##          term m   estimate        ubar            b           t dfcom       df         riv
## 1 (Intercept) 5  0.6767737 2.982354968 6.679090e-03 2.990369876   197 194.4394 0.002687443
## 2      basval 5 -0.5354029 0.006510448 1.426767e-05 0.006527569   197 194.4534 0.002629804
## 3        trt2 5 -1.8374286 0.442824415 3.722857e-04 0.443271157   197 194.8238 0.001008849
##        lambda        fmi
## 1 0.002680240 0.01278278
## 2 0.002622906 0.01272531
## 3 0.001007832 0.01110765

We can observe that the df in the pooled results is 194.8238 for trt2 term.

However, when I use the emmeans() function to compute the ls-means for trt variable, the default df becomes to 197 that seems the residual df (dfcom column) in the pooled results from mice.

emmeans::emmeans(mods, ~trt)
##  trt emmean   SE  df lower.CL upper.CL
##  1    -10.6 0.47 197    -11.5    -9.63
##  2    -12.4 0.47 197    -13.3   -11.46
## 
## Confidence level used: 0.95 

As I can not find any information from emmeans document for explaining this pooling process. Could you help clarify which df is used to compute the p-value and why that would be a better choice?

Thanks a lot!.

rvlenth commented 1 month ago

One problem with developing a package that supports a lot of different models is that I have little expertise on some of those models; and multiple imputation is such an example. A glimpse of what is done is described in the "models" vignette. Looking at the code. I think what happens is that the d.f. come from the first of the analyses included. Perhaps you can verify that by rearranging the analyses and see if it makes a difference. You have provided an illustration of a case where different dfs happen, but it's not an example because I don't have the data.

I would comment that there is no practical difference between 194.8 df and 197 df in terms of P values or confidence limits. Note that the .975 quantiles agree to 4 digits:

> qt(.975, df = c(194.8, 197))
[1] 1.972217 1.972079

Outputting d.f. to 4 decimal places is way, way more precision than is necessary. So at least in this illustration, this is not an important issue. My sense also is that with MI methods, all the analyses are about the same, with somewhat different imputed datasets. This would suggest that the degrees of freedom are about the same as well, so that it doesn't matter very much. Am I incorrect? Are there cases where the d.f. are notably discrepant between the analyses being combined? Do you know of established d.f. methods that should be used?

rvlenth commented 1 month ago

PS I guess I do see in the help for pool() a lit reference to Barnard, J. and Rubin, D.B. (1999). Small sample degrees of freedom with multiple imputation. Biometrika, 86, 948-955. It's a bit complicated that each regression coefficient gets different d.f. I suppose that Satterthwaite's formula can be used to get d.f. for a linear function. But the question is whether it's worth doing this.

jpiaskowski commented 1 month ago

This might be a question more suited to the {mice} package authors/maintainers. A look at the MICE documentation for pool():

dfcom: A positive number representing the degrees of freedom in the complete-data analysis. Normally, this would be the number of independent observation minus the number of fitted parameters. The default (dfcom = NULL) extract this information in the following order: 1) the component residual.df returned by glance() if a glance() function is found, 2) the result of df.residual (applied to the first fitted model, and 3) as 999999. In the last case, the warning "Large sample assumed" is printed. If the degrees of freedom is incorrect, specify the appropriate value manually.

The item 'df' is a product of the object class 'mipo', which is not clearly explained in mice documentation. The explanation above suggests that 'df' is for something less than a 'complete-data analysis', although I cannot be sure.

I believe that {emmeans} is extracting 'dfcom' when called, which is why those match between the two object types.

rvlenth commented 1 month ago

Apparently, that Barnard & Rubin (1999) article is widely cited, suggesting this is important to a lot of people. So there's that.

I looked at that article as well as the code in the mice package, and I think I have an implementation of the Barnard-Rubin method. It's based on a formula involving the (adjusted) variance t of the estimate itself and the variance among the estimates, b. (These are two of the quantities that show up in the pool output.) So what I do is save the two associated covariance matrices and compute the quadratic forms for a given set of coefficients, Most of the code for that is in the function mice:::barnard.rubin and I just adapted that. To make coding easier, though, I just use the first analysis to obtain the uncorrected df that they label dfcom.

Here's an example using a standard dataset.

library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind

imp = mice(nhanes, method = "norm.predict", seed = 2024.0527)
##     (chatter deleted)

mods1 = with(imp, lm(bmi ~ hyp))
pool(mods1)
## Class: mipo    m = 5 
##          term m  estimate     ubar           b        t dfcom       df
## 1 (Intercept) 5 27.806537 7.537912 0.016643167 7.557883    23 21.17388
## 2         hyp 5 -1.047378 4.580027 0.004937166 4.585952    23 21.20315
##           riv      lambda        fmi
## 1 0.002649514 0.002642512 0.08515780
## 2 0.001293573 0.001291902 0.08381901

emmeans::emmeans(mods1, "hyp", at = list(hyp = 1:2))
##  hyp emmean    SE   df lower.CL upper.CL
##    1   26.8 0.929 20.8     24.8     28.7
##    2   25.7 1.827 21.2     21.9     29.5
## 
## Confidence level used: 0.95
mods2 = with(imp, lm(bmi ~ age + hyp))
pool(mods2)
## Class: mipo    m = 5 
##          term m  estimate      ubar            b         t dfcom       df
## 1 (Intercept) 5 28.513466 5.2295925 0.0232225559 5.2574595    22 20.12987
## 2         age 5 -3.284127 0.9441445 0.0055452987 0.9507989    22 20.09340
## 3         hyp 5  3.079579 4.6420112 0.0003029873 4.6423748    22 20.23797
##            riv       lambda        fmi
## 1 5.328726e-03 5.300481e-03 0.09131043
## 2 7.048030e-03 6.998702e-03 0.09299741
## 3 7.832483e-05 7.831869e-05 0.08613760

emmeans::emmeans(mods2, "hyp", at = list(hyp = 1:2))
##  hyp emmean    SE   df lower.CL upper.CL
##    1   25.8 0.819 20.0     24.1     27.5
##    2   28.9 1.785 20.2     25.2     32.6
## 
## Confidence level used: 0.95

Created on 2024-05-27 with reprex v2.1.0

I'll push this to GitHub shortly.

If you're curious, here's the gory details for this example

> RG = ref_grid(mods1)
> RG@dffun    # How we compute d.f. for a linear combination a of reg coefs
function(a, dfargs) { # pretty much copied from mice:::barnard.rubin
        dfcom = dfargs$df1(a, dfargs)
        with(dfargs, { 
             b = sum(a * (B %*% a))
             t = sum(a * (T %*% a))
             lambda = (1 + 1/m) * b / t
             dfold = (m - 1)/lambda*2
             dfobs = (dfcom + 1)/(dfcom + 3) * dfcom * (1 - lambda)
             ifelse(is.infinite(dfcom), dfold, 
                    dfold * dfobs/(dfold + dfobs))
        }) }
<environment: base>

> RG@dfargs   # parameters used
$df
[1] 23

$m
[1] 5

$df1
function(k, dfargs) dfargs$df
<bytecode: 0x00000259c9a26ea8>
<environment: 0x0000025a4f9b3bf0>

$B
             [,1]         [,2]
[1,]  0.016643167 -0.008833903
[2,] -0.008833903  0.004937166

$T
            (Intercept)       hyp
(Intercept)    7.557883 -5.640687
hyp           -5.640687  4.585952
rvlenth commented 1 month ago

Oops, I found a typo in the code. I had dfold = (m - 1)/lambda*2 instead of dfold = (m - 1)/lambda^2.

Here are the correct results:

emmeans::emmeans(mods1, "hyp", at = list(hyp = 1:2))
##  hyp emmean    SE   df lower.CL upper.CL
##    1   26.8 0.929 21.1     24.8     28.7
##    2   25.7 1.827 21.2     21.9     29.5
## 
## Confidence level used: 0.95

emmeans::emmeans(mods2, "hyp", at = list(hyp = 1:2))
##  hyp emmean    SE   df lower.CL upper.CL
##    1   25.8 0.819 20.2     24.1     27.5
##    2   28.9 1.785 20.2     25.2     32.6
## 
## Confidence level used: 0.95
kaigu1990 commented 1 month ago

@rvlenth @jpiaskowski Thank you for your response so promptly. You're right, the df calculation for the pooling process uses the the Barnard-Rubin adjustment for small samples (Barnard and Rubin, 1999) that mentioned in the pool() documents. I think it's good for keeping the same page with mice package in the handing the df calculation.

kaigu1990 commented 1 month ago

Oops, I found a typo in the code. I had dfold = (m - 1)/lambda*2 instead of dfold = (m - 1)/lambda^2.

Here are the correct results:

emmeans::emmeans(mods1, "hyp", at = list(hyp = 1:2))
##  hyp emmean    SE   df lower.CL upper.CL
##    1   26.8 0.929 21.1     24.8     28.7
##    2   25.7 1.827 21.2     21.9     29.5
## 
## Confidence level used: 0.95

emmeans::emmeans(mods2, "hyp", at = list(hyp = 1:2))
##  hyp emmean    SE   df lower.CL upper.CL
##    1   25.8 0.819 20.2     24.1     27.5
##    2   28.9 1.785 20.2     25.2     32.6
## 
## Confidence level used: 0.95

Hi @rvlenth , can I re-run the updated emmeans::emmeans() function right now? I installed the emmeans from development version and would like to try it verify it again.

rvlenth commented 1 month ago

I just pushed up the corrected code. You can install emmeans from GitHub - follow the instructions on the Code page.

kaigu1990 commented 1 month ago

I have tried it in my datasets, and it looks great. Thank you for your prompt response.

rvlenth commented 1 month ago

I'm closing this issue as it seems to be resolved