paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.28k stars 184 forks source link

Data-related post-processing for multiple imputation models via brm_multiple() #997

Open jmgirard opened 4 years ago

jmgirard commented 4 years ago

When I try to use bayes_R2() on a brmsfit_multiple object, I get a warning message that only the first imputed data set was used. I assume this means that R2 was only calculated using the first imputed dataset. I asked about this on the mc-stan discourse and it seems that R2 can be pooled from the combined posterior samples just like many other parameters. Would it be possible to add that functionality to the bayes_R2() method? I tried to look into doing it myself, but was struggling to understand the fit$fit S4 object.

library(brms)
library(mice)
imp <- mice(nhanes2)
fit <- brm_multiple(bmi ~ age + hyp + chl, data = imp, chains = 1)
bayes_R2(fit)
#>    Estimate Est.Error      Q2.5    Q97.5
#> R2 0.4452458 0.1071932 0.2068827 0.623473
#> Warning message:
#> Using only the first imputed data set. Please interpret the results with caution until a more principled approach has been implemented. 
jmgirard commented 4 years ago

Hmm... it seems that bayes_R2() uses two primary variables: ypred and y.

ypred is created by posterior_epred() and prepare_predictions(). It looks like prepare_predictions() does include posterior samples from all imputed datasets for the model parameters but calculates the predictions using x$data which only includes the first imputed dataset. To pool this, I think we would need to align each sample with its original x data. Similarly, y is created by get_y() which only uses the first imputed dataset.

A further complication is that it seems brmsfit_multiple objects only store the first imputed dataset in $data. Perhaps a workaround would be for the list of imputed dataframes (or mids object) to be given as an additional argument to a new bayes_R2.brmsfit_multiple() method.

jmgirard commented 4 years ago

Ok, I think the easiest approach is just to do this from the output of brm_multiple() when combine = FALSE.

library(mice)
library(brms)

imp <- mice(nhanes2)

mlist <- brm_multiple(
  formula = bmi ~ age + hyp + chl, 
  data = imp, 
  seed = 2020,
  chains = 1, 
  combine = FALSE
)

pool_R2 <- function(mlist, probs = c(0.025, 0.975), robust = FALSE, ...) {
  r2_post <- sapply(mlist, bayes_R2, summary = FALSE, ..., simplify = TRUE)
  posterior_summary(c(r2_post), probs = probs, robust = robust)
}

pool_R2(mlist)
#>       Estimate Est.Error     Q2.5     Q97.5
#> [1,] 0.5406651 0.1293193 0.258285 0.7449243

Perhaps the output of brm_multiple(..., combine = FALSE) could be given a class like brms_mlist or something like that and then a fancier version of my pool_R2() function could become the bayes_R2() method for that class.

paul-buerkner commented 4 years ago

I agree with the idea that we should add better post-processing support for brm_multiple models but this shouldn't just encompass bayes_R2 but all post-processing methods making use of .$data. However, it is not yet clear to be what the best output should be once we start storing all data sets in the object. For example, should we just return a list of predictions and leave it to the user to merge them somehow or shall we do some automatic merging, and if yes, how shall that look like?

I have changed the title of this issue to reflect the broader scope.

jmgirard commented 4 years ago

I could see three general approaches:

1) Make the default output of brm_multiple() be a list of brmsfit objects from each dataset (as the current output with combine = FALSE). Then we could give that type of list a class and write methods for this class that would handle the pooling across datasets.

2) Make the default output of brm_multiple() be a new class that is similar to brmsfit but includes data per dataset and/or already pooled. Then we would need new methods for this class whenever the format changed.

3) Make the default output of brm_multiple() be a brmsfit object (as the current output with combine = TRUE) but with some extra information added (e.g., data per dataset and/or already pooled) and add some conditional logic to the existing methods to accommodate this extra information when it is present.

al-obrien commented 4 years ago

I like the brainstorming around how to best incorporate post-processing with brm_multiple(). Not sure if this helps at all, but I thought I would share how I was dabbling with other model comparison metrics (waic, loo) and MICE: would it also be possible to use the combine = T output from brm_multiple() and analyze the full imputed dataset as newdata in the various metrics?

Statistically speaking, I am not 100% confident if this is a correct approach but I thought I would share anyway. I tried using the complete set of posterior samples on a 'new' dataset containing all the imputations from MICE. My thinking was that this would ensure the log-likelihoods to calculate waic/loo were from both the complete set of posterior samples and imputed data they were generated from. If I am completely off-base or if this approach would be less than palatable, please let me know!

Setup MICE model and datasets

library(mice)
library(brms)

# Create the impute object and the full set
imp <- mice(nhanes2)
imp_c <- mice::complete(imp, 'long')

mlist <- brm_multiple(
  formula = bmi ~ age + hyp + chl,
  warmup = 1000, 
  iter  = 2500,
  data = imp, 
  seed = 2020,
  chains = 2, 
  combine =T
)

Model object post processing

# With just the imputed dataset
brms::loo(mlist3)
Computed from 15000 by 25 log-likelihood matrix

         Estimate  SE
elpd_loo    -72.0 4.0
p_loo         9.6 2.0
looic       144.0 8.1
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     17    68.0%   2193      
 (0.5, 0.7]   (ok)        6    24.0%   705       
   (0.7, 1]   (bad)       2     8.0%   421       
   (1, Inf)   (very bad)  0     0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
Warning messages:
1: Using only the first imputed data set. Please interpret the results with caution until a more principled approach has been implemented. 
2: Found 2 observations with a pareto_k > 0.7 in model 'mlist3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. 

Model object post processing with complete dataset

# With the complete imputed dataset by passing in as newdata
brms::loo(mlist3, newdata = imp_c)
Computed from 15000 by 125 log-likelihood matrix

         Estimate   SE
elpd_loo   -369.2 13.3
p_loo        57.3  9.2
looic       738.4 26.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     89    71.2%   2193      
 (0.5, 0.7]   (ok)       30    24.0%   303       
   (0.7, 1]   (bad)       4     3.2%   111       
   (1, Inf)   (very bad)  2     1.6%   7         
See help('pareto-k-diagnostic') for details.
Warning message:
Found 6 observations with a pareto_k > 0.7 in model 'mlist3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations.  
jmgirard commented 4 years ago

@paul-buerkner If you can provide some guidance, I'd be happy to take a stab at this in a PR.

paul-buerkner commented 4 years ago

I agree with the options you layed out earlier. However, I didn't have enough time yet to have a good opinion on which approach is beneficial. Right now, I would prefer the structure being basically as is, that is, the object looking like a brmsfit object just with an additional child class brmsfit_multiple and, changing what is currently done right now, store all data sets in a list. The main decision problem I see is how to change the output of the methods. Most convenient would be just a list of output but there could be better options.