openpharma / mmrm

Mixed Models for Repeated Measures (MMRM) in R.
https://openpharma.github.io/mmrm/
Other
122 stars 21 forks source link

add tidy() and glance() methods #247

Closed Generalized closed 1 year ago

Generalized commented 1 year ago

Dear Authors,

I work on a quite complex longitudinal study with a compound binary endpoint, calculated based on a number of components (both numeric - though not all normal, and categorical). Each of these components will be analysed separately as well, in various manners (between-arm, within-arm, single-timpoint, MMRM).

I expect about 10-20% of intermittent and monotonous gaps in the data. The entire incomplete analysis dataset will be imputed via mice in a chained way: variable by variable, mostly by using the 2-level PMM (predictive mean matching), and serve as the input for all subsequent analyses.

Having the dataset complete (imputed), I can run all the necessary analyses on it (whether they need the complete data or not, like the MMRM) in a consistent manner. This is a must-have requirement raised by the statistical reviewer. They asked for a cross-analysis consistency, which simply means, that all analyses need to be performed on the same, imputed dataset. The reviewer wants me also to compare the results from MMRM itself, the PMM-MI-GEE and PMM-MI-MMRM.

OK, so far - so good.

For most analyses I have the pooling tools. The mice currently utilizes the broom "tidiers". The "tidiers" unify the way in which model components (estimates, VC) can be accessed. They create an "interface" between each kind of supported model and the mice functions.

Currently there are "tidiers" for many methods, including the nlme::gls() in broom::mixed. I can also use the "parameters" package, which cooperates with "mira" objects smoothly.

But I noticed no support for mmrm in both broom.mixed and parameters. Do you, maybe, consider adding support for these? Yes, I can make the broom "tidier" (glance, tidy) on my own, but it would be super helpful to have it consistent and "on the board".

Adding a support for "parameters" and/or brooom.mixed would facilitate the use of mmrm in many (non-necessarily most-common, but definitely possible) scenarios!

/ PS: no, the rbmi is not of my interest here, it's a different story here. It's agreed to have the PMM from mice in chained way. /

danielinteractive commented 1 year ago

Thanks @Generalized , good idea. We can consider that. If you happen to have any code that you could share as potential starting point that would be helpful / speed this up. I think given the current backlog this won't be happening in the next few months.

Generalized commented 1 year ago

Currently I tried these naive implementations taken from broom.mixed:::*gls

glance.mmrm <- function(x) {

  ss <- summary(x)
  with(ss, tibble::tibble(sigma = sqrt(varcor[1,1]),
                          df = nrow(coefficients),
                          logLik = aic_list[["logLik"]], 
                          AIC =  aic_list[["AIC"]], 
                          BIC =  aic_list[["BIC"]], 
                          df.residual = n_subjects - df))
}

tidy.mmrm <- function (x, conf.int = FALSE, conf.level = 0.95, ...) 
{
  . <- Value <- Std.Error <- `t-value` <- `p-value` <- NULL

  ret <- (summary(x)$coefficients %>% as.data.frame() %>% tibble::rownames_to_column(var = "term") %>% 
            as_tibble() %>% dplyr::rename(estimate = Estimate, std.error = `Std. Error`, 
                                          statistic = `t value`, p.value = `Pr(>|t|)`))
  if (conf.int) {
    cc <- (confint(x, level = conf.level) %>% as.data.frame() %>% 
             setNames(c("conf.low", "conf.high")))
    ret <- dplyr::bind_cols(ret, cc)
  }
  return(ret)
}

but the biggest problem is that I cannot make mmrm working with mice :(

Let's use this naive scenario (totally nonsensical, but just to illustrate)

nhanes <- structure(list(age = c(1, 2, 1, 3, 1, 3, 1, 1, 2, 2, 1, 2, 3, 
2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 2), bmi = c(NA, 22.7, NA, NA, 
20.4, NA, 22.5, 30.1, 22, NA, NA, NA, 21.7, 28.7, 29.6, NA, 27.2, 
26.3, 35.3, 25.5, NA, 33.2, 27.5, 24.9, 27.4), hyp = structure(c(NA, 
1L, 1L, NA, 1L, NA, 1L, 1L, 1L, NA, NA, NA, 1L, 2L, 1L, NA, 2L, 
2L, 1L, 2L, NA, 1L, 1L, 1L, 1L), levels = c("1", "2"), class = "factor"), 
    chl = c(NA, 187, 187, NA, 113, 184, 118, 187, 238, NA, NA, 
    NA, 206, 204, NA, NA, 284, 199, 218, NA, NA, 229, 131, NA, 
    186), id = structure(1:25, levels = c("1", "2", "3", "4", 
    "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", 
    "16", "17", "18", "19", "20", "21", "22", "23", "24", "25"
    ), class = "factor"), tim = structure(c(1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), class = "factor", levels = "1")), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25"), class = "data.frame")

nhanes_miss <- mice(nhanes, m = 5, method = "pmm", seed = 123)

It works for the nlme::gls()

> imp_m_gls <- with(nhanes_miss, gls(bmi ~ hyp + chl))
> (gls_pooled <- pool(imp_m_gls))

Class: mipo    m = 5 
         term m   estimate         ubar            b            t dfcom        df       riv    lambda       fmi
1 (Intercept) 5 19.8365753 2.081502e+01 4.9135453308 2.671128e+01    22 13.230280 0.2832692 0.2207403 0.3167657
2        hyp2 5 -0.3542480 5.744374e+00 2.6254225741 8.894882e+00    22  9.270615 0.5484509 0.3541933 0.4594540
3         chl 5  0.0320654 6.100493e-04 0.0001111273 7.434021e-04    22 14.651663 0.2185934 0.1793817 0.2723609

> model_parameters(gls_pooled)
# Fixed Effects

Parameter   | Coefficient |   SE |         95% CI |     t |    df |     p
-------------------------------------------------------------------------
(Intercept) |       19.84 | 5.17 | [ 8.69, 30.98] |  3.84 | 13.23 | 0.002
hyp2        |       -0.35 | 2.98 | [-7.07,  6.36] | -0.12 |  9.27 | 0.908
chl         |        0.03 | 0.03 | [-0.03,  0.09] |  1.18 | 14.65 | 0.258

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.
Warning messages:
1: Could not recover model data from environment. Please make sure your data is available in your workspace.
  Trying to retrieve data from the model frame now. 
2: Could not get model data. 

but NOT with mmrm:

> imp_m_mmrm <- with(nhanes_miss, mmrm(bmi ~ hyp + chl + us(tim|id)))
Error in attr(data, "dataname") <- toString(match.call()$data) : 
  cannot set attribute on a symbol

But we can apply a workaround:

dat_mice <- complete(nhanes_miss, "all")

fit_mmrm <- function(dat) {
  mod <- mmrm(bmi ~ hyp + chl + us(tim|id),data=dat)
  return(mod)
}

imp_m_mmrm <- lapply(dat_mice, fit_mmrm)

> imp_m_mmrm
$`1`
mmrm fit

Formula:     bmi ~ hyp + chl + us(tim | id)
Data:        dat (used 25 observations from 25 subjects with maximum 1 timepoints)
Covariance:  unstructured (1 variance parameters)
Method:      REML
Deviance:    135.6773

Coefficients: 
(Intercept)        hyp2         chl 
18.27884063  0.90312733  0.03547781 

Model Inference Optimization:
Converged with code 0 and message: convergence: rel_reduction_of_f <= factr*epsmch

$`2`......
.....

The 2 results can be pooled and they perfectly agree.

> pool(imp_m_mmrm)
Class: mipo    m = 5 
         term m   estimate         ubar            b            t dfcom        df       riv    lambda       fmi
1 (Intercept) 5 19.8365753 2.081502e+01 4.9135453308 2.671128e+01    22 13.230279 0.2832692 0.2207403 0.3167657
2        hyp2 5 -0.3542480 5.744374e+00 2.6254225741 8.894881e+00    22  9.270615 0.5484509 0.3541933 0.4594540
3         chl 5  0.0320654 6.100493e-04 0.0001111273 7.434021e-04    22 14.651663 0.2185934 0.1793818 0.2723609

> pool(imp_m_gls)
Class: mipo    m = 5 
         term m   estimate         ubar            b            t dfcom        df       riv    lambda       fmi
1 (Intercept) 5 19.8365753 2.081502e+01 4.9135453308 2.671128e+01    22 13.230280 0.2832692 0.2207403 0.3167657
2        hyp2 5 -0.3542480 5.744374e+00 2.6254225741 8.894882e+00    22  9.270615 0.5484509 0.3541933 0.4594540
3         chl 5  0.0320654 6.100493e-04 0.0001111273 7.434021e-04    22 14.651663 0.2185934 0.1793817 0.2723609

> summary(pool(imp_m_mmrm))
         term   estimate std.error  statistic        df     p.value
1 (Intercept) 19.8365753 5.1682953  3.8381273 13.230279 0.001992168
2        hyp2 -0.3542480 2.9824287 -0.1187784  9.270615 0.907985019
3         chl  0.0320654 0.0272654  1.1760471 14.651663 0.258331811

> summary(pool(imp_m_gls))
         term   estimate std.error  statistic        df     p.value
1 (Intercept) 19.8365753 5.1682954  3.8381272 13.230280 0.001992168
2        hyp2 -0.3542480 2.9824288 -0.1187784  9.270615 0.907985021
3         chl  0.0320654 0.0272654  1.1760471 14.651663 0.258331829

Let's also check the glance and tidy:

> m_mmrm <- mmrm(bmi ~ hyp + chl + us(tim|id),data= nhanes[complete.cases(nhanes),])
> m_gls  <- gls (bmi ~ hyp + chl, nhanes[complete.cases(nhanes),])

> glance(m_mmrm)
# A tibble: 1 × 6
  sigma    df logLik   AIC   BIC df.residual
  <dbl> <int>  <dbl> <dbl> <dbl>       <int>
1  4.66     3  -36.3  74.6  75.2          10
> glance(m_gls)
# A tibble: 1 × 6
  sigma    df logLik   AIC   BIC df.residual
  <dbl> <int>  <dbl> <dbl> <dbl>       <int>
1  4.66     3  -36.3  80.6  81.8          10

> tidy(m_mmrm)
# A tibble: 3 × 6
  term        estimate std.error    df statistic p.value
  <chr>          <dbl>     <dbl> <dbl>     <dbl>   <dbl>
1 (Intercept)  19.4       5.73    10.0     3.39  0.00689
2 hyp2         -0.685     3.40    10.0    -0.202 0.844  
3 chl           0.0378    0.0305  10.0     1.24  0.244  

> tidy(m_gls)
# A tibble: 3 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  19.4       5.73       3.39  0.00689
2 hyp2         -0.685     3.40      -0.202 0.844  
3 chl           0.0378    0.0305     1.24  0.244 

But the workaround for the mmrm gives a list of fit models rather than proper mira object, so it won't work with emmeans:

> emmeans(imp_m_mmrm, specs = ~hyp+chl, data = nhanes)
Error in (function (object, at, cov.reduce = mean, cov.keep = get_emm_option("cov.keep"),  : 
  Can't handle an object of class  “list” 
 Use help("models", package = "emmeans") for information on supported models.

It would be good to allow mmrm to work with mira without the workaround, so we can use the emmeans to test certain contrasts and take the advantage of having the VarCov and employ the parametric mvt adjustment for p-values and the CIs.


PS: I'm not sure if the DoF for glance() and tidy() should be the number of estimates and the number of observations - number of parameters (residual), especially that it doesn't account for the extra covariance parameters... For example the car::Anova DOES take the number of covariance parameters into consideration (I discussed this with the Author). So much mess here in R, so I'm twisted now.

Maybe you could propose a more consistent approach what DoF should be reported for the need of pooling?

Generalized commented 1 year ago

PS: I found a way how to combine mice, mmrm and emmeans. Maybe you could write a short vignette about it?

As I mentioned previously, the mmrm doesn't work with "with()".

> imp_m_mmrm <- with(nhanes_miss, mmrm(bmi ~ hyp + chl + us(tim|id)))
Error in attr(data, "dataname") <- toString(match.call()$data) : 
  cannot set attribute on a symbol

So the workaround was to make a list of the datasets and iterate through using lapply. This, however, gives a list rather than "mira". But this can be casted into mira obect.

> dat_mice <- complete(nhanes_miss, "all")

> fit_mmrm <- function(dat) {
  mod <- mmrm(bmi ~ hyp + chl + us(tim|id),data=dat)
  return(mod)
}

> imp_m_mmrm <- lapply(dat_mice, fit_mmrm)
> summary(pool(imp_m_mmrm))
         term   estimate std.error  statistic        df     p.value
1 (Intercept) 19.8365753 5.1682953  3.8381273 13.230279 0.001992168
2        hyp2 -0.3542480 2.9824287 -0.1187784  9.270615 0.907985019
3         chl  0.0320654 0.0272654  1.1760471 14.651663 0.258331811

> emmeans(as.mira(imp_m_mmrm), specs = ~hyp+chl)
 hyp chl emmean   SE df lower.CL upper.CL
 1   187   25.8 1.98 22     21.7     29.9
 2   187   25.5 2.35 22     20.6     30.3

Confidence level used: 0.95 

and now I can also test the contrasts:

# For the original mmrm using only *complete cases* (13 - 3 parameters -> df =10)
> emmeans::contrast(emmeans(m_mmrm, specs = ~hyp+chl), list(some_contrast = c(1, -1)))
 contrast      estimate  SE df t.ratio p.value
 some_contrast    0.685 3.4 10   0.202  0.8443

# For the imputed datasets (25 - 3 -> df=22)
> emmeans::contrast(emmeans(as.mira(imp_m_mmrm), specs = ~hyp+chl), list(some_contrast = c(1, -1)))
 contrast      estimate   SE df t.ratio p.value
 some_contrast    0.354 2.98 22   0.119  0.9065

As I mentioned, it would be nice to have mmrm working with with() out of the box, but this approach bypasses this temporary limitation :)

danielinteractive commented 1 year ago

Thanks @Generalized ! So we are now fixing the with() problem in #252 . Do you then still need the workaround or is it then out of the box?

Generalized commented 1 year ago

Thank you! The "with()" was problematic at the level of doing the analysis in repeated manner. The necessity for having the "glance" and "tidy" (and maybe something else) defined for mmrm is at the level of pooling the results. It would be really helpful, if mmrm cooperated with broom::tidy() and glance()

/ justification: Let's not forget, that although MMRM can handle MAR data itself and doesn't require any imputation (in theory), but in more complex trials with multi-component primary endpoint there is often a need for multi-variable imputation, in a chain. Some variables are used for deriving the primary compound endpoint, some are analyzed independently, some are used to impute others, some are imputed manually or passively. MICE allows for all of it. It all exceeds simple longitudinal modelling. As a consequence, we end up with a common (e.g. mITT) complete dataset of imputed variables, used to derive the compound endpoint, used for a variety of related analyses (so we want consistency between them) and for doing various sensitivity analyses. This is a great simplification. All the more so as doin MMRM on such complete dataset not only preserves the sample size, but also provides consistency with other analyses (and simple tests). While in classic RCTs, the primary question is often about just a simple between-arm comparison (via lm() or even a t-test) at a given timepoint, there are also more advanced secondary comparisons, including the analysis of within-arm trends or XYZ-with-time interactions to be analyzed, where we prefer to have a single TRTxTIMEx(covariates) interactions analyzed then via LS-means. Therefore, the combination of MI + MMRM is very flexible and definitely worthwile. /

danielinteractive commented 1 year ago

Great thanks @Generalized so we can focus on this issue here to add the two methods :-)

danielinteractive commented 1 year ago

being worked on in #325 and duplicate of #324