JomoImpute taking a long time with mitml

emlowth commented 5 months ago

Hi there,

I have been following mitml - as it seems to be the only package which I could use. I'm trying to impute a glmmTMB binomial model which has 200,000 observations over 17 time-points (wave) adjusted for person_id, wave, and schools_id nested within the local governance area. It is a complex model, but the typical model without imputation takes around 20-minutes to run with glmmTMB (standard glmer does not run and has convergence issues). We do not have much missing data, around 5% for most covariates, and 17 - 20% for a couple others. We have no missing data on the two-level variables. We do have missing data for the outcome variable from 10%. The data is also unbalanced, with half of the sample only having two-waves of data.

I've been using jomoImpute as it was made for categorical and continuous data and I am having no luck after following the information https://cran.r-project.org/web/packages/mitml/vignettes/Introduction.html - which was very helpful and straightforward.

Here is my code so far: fml <- attainment + disability + attendance_std + fsm + gender + ethnicity + deprivation + health +birthweight + gestational_age + multiple_birth + anomalies + breastfeeding + month_of_birth + birth_cohort ~ 1 + (1 | wave) + (1 | person_id) + (1 | governance/school_id)))

imp <- jomoImpute(data, formula = fml, n.burn = 5000, m=10)

I've waited 3 days and nothing has completed, it did not even get to counting the burn-in stage. I tried using panImpute, and it did run but the convergence was poor and that's likely to it being only for linear models.

Any helpful advice you have would be appreciated.

Emily

simongrund1 commented 4 months ago

This seems like a very large data set, so that might explain some of the issues you're seeing. There are also some aspects of the model specification that might add to this. Specifically, the right-hand side includes multiple levels of clustering (wave, person, schools), but jomo and mitml only support one level of clustering.

Depending on the design, you could try simplifying the model, and some parts might also be superfluous. For example, if this is a longitudinal study with nested data--measurement occasions (waves) nested in persons, persons nested in schools--then you should probably remove the (1|wave) random effect, unless there are somehow multiple observations nested within waves. The remaining levels of clustering (persons, schools/ares) would need to be simplified, for example, by (a) removing levels or (b) coding them as a factor variables and including then as fixed effects (like so: ~ 1 + school_factor + area_factor + (1|person_id)). Neither of these options are ideal, and how well they work depends on what analyses you want to run.

For three-level models that support higher levels of clustering, there are also other software options that actually support higher levels of clustering. For example, the packages mice and miceadds support any number of levels; Mplus and Blimp support three levels. The imputations generated by these packages can still be converted to the mitml format afterwards for analysis.

emlowth commented 4 months ago

Hi Simon,

Thank you for getting back to me. I think I assumed that more complex clustering would be accepted - but that is my problem if I didn't identify this well in the information.

Thanks for the other tips - this is useful. I agree that including Wavve is a difficult decision - I have had varying advice from different statisticians on it's inclusion. It may well be overstretching the model in places as some people only have 2 waves of data and some have 4.

I'll take a look at mice and miceadds - I will definitely be coming back to mitml for some analysis I'm doing with only one level as the information provided is very helpful.

simongrund1 commented 4 months ago

Treating this as solved, but let me know if it should be re-opened.

simongrund1 / mitml

JomoImpute taking a long time with mitml #24