mayer79 / missRanger

Fast multivariate imputation by random forests.
https://mayer79.github.io/missRanger/
GNU General Public License v2.0
61 stars 11 forks source link

Question on missRanger and BRMS #30

Closed GabriellaS-K closed 3 years ago

GabriellaS-K commented 3 years ago

Hi,

Thank you for a brilliant package. I'm using missRanger to impute, and then apply BRMS to the imputed dataset. BRMS describes how to use the mice package, but missRanger imputed data comes out quite different.

Ideally I would have imputed the data, pooled the data, run my models, run model comparisons. But I cannot then pool using mice, it doesn't work. So instead I run multiple models on imputed data like this:

models_imputed <- brm_multiple(formula = score ~ 1 + cs(group), data = imputed, family = acat("cloglog"), combine=TRUE, chains=1) But this is pretty clunky, and if I try to do a LOO on my models (I have 5) I get the error: Using only the first imputed data set. Please interpret the results with caution until a more principled approach has been implemented.

This isn't an issue with missRanger as such, more that I'm caught in the space between missRanger and BRMS and am not sure how to get them to work together...hoping someone might have advice!

Thanks

mayer79 commented 3 years ago

I think brm_multiple just expects a list of datasets, so you can basically go along the lines of the missRanger multiple imputation vignette on https://cran.r-project.org/web/packages/missRanger/vignettes/multiple_imputation.html

Let me know if the results look (un-)reasonable.

# Via mice
library(mice)
library(brms)

imp <- mice(nhanes, m = 5, print = FALSE)

fit_imp1 <- brm_multiple(bmi ~ age*chl, data = imp, chains = 2)

# With missRanger
library(missRanger)

# Generate 5 complete data sets
imp <- replicate(5, missRanger(nhanes, verbose = 0, num.trees = 50, pmm.k = 5),
                 simplify = FALSE)

# Fit model
fit_imp2 <- brm_multiple(bmi ~ age*chl, data = imp, chains = 2)
GabriellaS-K commented 3 years ago

HI,

You so much for the answer, that's actually what I tried to do-my imputed dataset (called imputed) was fed straight into the bar and multiple just like you did in your example with fit_imp2. The model runs, the problem comes after-I'd like to compare different models together using the LOO function, but because it isn't pooled it only uses the first imputed dataset

mayer79 commented 3 years ago

Hmm. If you could adapt my examples (both mice and missRanger) accordingly, that would be fantastic.

GabriellaS-K commented 3 years ago

I'm not sure what you mean by adapt your examples, sorry!!

mayer79 commented 3 years ago

I would need a fully reproducible example to see what works and what not.

GabriellaS-K commented 3 years ago

Ah ok, great!

Please find below:

Here is a subset of my data:

 structure(list(agequartiles = structure(c(1L, 3L, 2L, 1L, 2L, 
4L, 3L, 1L, 3L, 4L, 1L, 2L, 2L, 2L, 4L, 1L, 3L, 3L, 4L, 4L, 4L, 
3L, 4L, 1L, 4L, 3L, 1L, 4L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 3L, 2L, 
2L, 3L, 4L, 4L, 3L, 2L, 3L, NA, 1L, 1L, 1L, 2L, 2L), .Label = c("[18,23]", 
"(23,27]", "(27,32]", "(32,54]"), class = "factor"), sentiment = c(1, 
1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 
1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 2, 1, 
1, 2, 1, 1, 3, 1, 3), group = structure(c(2L, 3L, 3L, 2L, 2L, 
1L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 3L, 2L, 2L, 1L, 3L, 
1L, 3L, 2L, 1L, 2L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 2L, 3L, 
3L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 2L), .Label = c("prime1", 
"prime2", "prime3"), class = "factor"), continent = c("UK", "Australia and New Zealand", 
"Northern America", "UK", "Northern America", "Australia and New Zealand", 
"Asia and the Pacific", "UK", "Southern and Central America", 
"Australia and New Zealand", "UK", "Northern America", "Northern America", 
"UK", "Northern America", "UK", "UK", "Northern America", "UK", 
"Northern America", "Northern America", "Southern and Central America", 
"Northern America", "UK", "Europe", "Northern America", "UK", 
"Northern America", NA, "UK", "UK", "Australia and New Zealand", 
"Australia and New Zealand", "UK", "UK", "UK", "Australia and New Zealand", 
"Northern America", "UK", "Northern America", "UK", "Asia and the Pacific", 
"Northern America", "Northern America", NA, NA, "UK", "Europe", 
"UK", "Northern America"), ID = 1:50, medication = c("FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", 
"FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "TRUE", "TRUE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "TRUE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "TRUE", 
"FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "TRUE")), row.names = c(NA, 
50L), class = "data.frame")

Then I imputed:


library(missRanger)
data <- lapply(3456:3460, function(x)
  missRanger(
    data,
     . #predict all columns 
    ~ . #Make predictions using all columns except:
    - ID,
    maxiter = 10,# How many iterations until it stops? 
    pmm.k = 3, #Predictive Mean Matching leading to more natural imputations and improved distributional properties of the resulting values
    verbose = 1,#how much info is printed to screen, 
    seed = x,#Integer seed to initialize the random generator.
    num.trees = 200,
    returnOOB = TRUE,
    case.weights = NULL
  )
)

Then I ran 5 models

models_group <- brm_multiple(formula = sentiment  ~ 1 + cs(group),  data = data, family = acat("cloglog"), combine=TRUE, chains=4)

models_meds <- brm_multiple(formula = sentiment  ~ 1 + cs(group)+ medication,  data = data, family = acat("cloglog"), combine=TRUE, chains=4)

models_age <- brm_multiple(formula = sentiment  ~ 1 + cs(group)+age,  data = data, family = acat("cloglog"), combine=TRUE, chains=4)

models_continent <- brm_multiple(formula = sentiment  ~ 1 + cs(group)+continent,  data = data, family = acat("cloglog"), combine=TRUE, chains=4)

models_all<-models_age <- brm_multiple(formula = sentiment  ~ 1 + cs(group) +age +medication+continent,  data = data, family = acat("cloglog"), combine=TRUE, chains=4)

And finally the LOO

modelcomparison<-loo(models_all, models_group, models_meds, model_continent, models_age)

mayer79 commented 3 years ago

Okay, thanks a lot for that example. I visited

My first thought:

  1. use combine = FALSE in brm_multiple(), then
  2. pool result of brm_multiple() doing some Bayesian magic, then
  3. run loo

I would actually suggest to ask the brms team how they would approach the problem. I think it would be quite cool if loo would work on the output of brm_multiple(), independent of using missRanger or another algo.

GabriellaS-K commented 3 years ago

OK great thank you for that, I will do!