Open jmgirard opened 4 years ago
Hmm... it seems that bayes_R2()
uses two primary variables: ypred
and y
.
ypred
is created by posterior_epred()
and prepare_predictions()
. It looks like prepare_predictions()
does include posterior samples
from all imputed datasets for the model parameters but calculates the predictions using x$data
which only includes the first imputed dataset. To pool this, I think we would need to align each sample with its original x data. Similarly, y
is created by get_y()
which only uses the first imputed dataset.
A further complication is that it seems brmsfit_multiple
objects only store the first imputed dataset in $data
. Perhaps a workaround would be for the list of imputed dataframes (or mids object) to be given as an additional argument to a new bayes_R2.brmsfit_multiple()
method.
Ok, I think the easiest approach is just to do this from the output of brm_multiple()
when combine = FALSE
.
library(mice)
library(brms)
imp <- mice(nhanes2)
mlist <- brm_multiple(
formula = bmi ~ age + hyp + chl,
data = imp,
seed = 2020,
chains = 1,
combine = FALSE
)
pool_R2 <- function(mlist, probs = c(0.025, 0.975), robust = FALSE, ...) {
r2_post <- sapply(mlist, bayes_R2, summary = FALSE, ..., simplify = TRUE)
posterior_summary(c(r2_post), probs = probs, robust = robust)
}
pool_R2(mlist)
#> Estimate Est.Error Q2.5 Q97.5
#> [1,] 0.5406651 0.1293193 0.258285 0.7449243
Perhaps the output of brm_multiple(..., combine = FALSE)
could be given a class like brms_mlist
or something like that and then a fancier version of my pool_R2()
function could become the bayes_R2()
method for that class.
I agree with the idea that we should add better post-processing support for brm_multiple
models but this shouldn't just encompass bayes_R2
but all post-processing methods making use of .$data
. However, it is not yet clear to be what the best output should be once we start storing all data sets in the object. For example, should we just return a list of predictions and leave it to the user to merge them somehow or shall we do some automatic merging, and if yes, how shall that look like?
I have changed the title of this issue to reflect the broader scope.
I could see three general approaches:
1) Make the default output of brm_multiple()
be a list of brmsfit
objects from each dataset (as the current output with combine = FALSE
). Then we could give that type of list a class and write methods for this class that would handle the pooling across datasets.
2) Make the default output of brm_multiple()
be a new class that is similar to brmsfit
but includes data per dataset and/or already pooled. Then we would need new methods for this class whenever the format changed.
3) Make the default output of brm_multiple()
be a brmsfit
object (as the current output with combine = TRUE
) but with some extra information added (e.g., data per dataset and/or already pooled) and add some conditional logic to the existing methods to accommodate this extra information when it is present.
I like the brainstorming around how to best incorporate post-processing with brm_multiple()
. Not sure if this helps at all, but I thought I would share how I was dabbling with other model comparison metrics (waic, loo) and MICE: would it also be possible to use the combine = T
output from brm_multiple()
and analyze the full imputed dataset as newdata
in the various metrics?
Statistically speaking, I am not 100% confident if this is a correct approach but I thought I would share anyway. I tried using the complete set of posterior samples on a 'new' dataset containing all the imputations from MICE. My thinking was that this would ensure the log-likelihoods to calculate waic/loo were from both the complete set of posterior samples and imputed data they were generated from. If I am completely off-base or if this approach would be less than palatable, please let me know!
library(mice)
library(brms)
# Create the impute object and the full set
imp <- mice(nhanes2)
imp_c <- mice::complete(imp, 'long')
mlist <- brm_multiple(
formula = bmi ~ age + hyp + chl,
warmup = 1000,
iter = 2500,
data = imp,
seed = 2020,
chains = 2,
combine =T
)
# With just the imputed dataset
brms::loo(mlist3)
Computed from 15000 by 25 log-likelihood matrix
Estimate SE
elpd_loo -72.0 4.0
p_loo 9.6 2.0
looic 144.0 8.1
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 17 68.0% 2193
(0.5, 0.7] (ok) 6 24.0% 705
(0.7, 1] (bad) 2 8.0% 421
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
Warning messages:
1: Using only the first imputed data set. Please interpret the results with caution until a more principled approach has been implemented.
2: Found 2 observations with a pareto_k > 0.7 in model 'mlist3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations.
# With the complete imputed dataset by passing in as newdata
brms::loo(mlist3, newdata = imp_c)
Computed from 15000 by 125 log-likelihood matrix
Estimate SE
elpd_loo -369.2 13.3
p_loo 57.3 9.2
looic 738.4 26.7
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 89 71.2% 2193
(0.5, 0.7] (ok) 30 24.0% 303
(0.7, 1] (bad) 4 3.2% 111
(1, Inf) (very bad) 2 1.6% 7
See help('pareto-k-diagnostic') for details.
Warning message:
Found 6 observations with a pareto_k > 0.7 in model 'mlist3'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations.
@paul-buerkner If you can provide some guidance, I'd be happy to take a stab at this in a PR.
I agree with the options you layed out earlier. However, I didn't have enough time yet to have a good opinion on which approach is beneficial. Right now, I would prefer the structure being basically as is, that is, the object looking like a brmsfit object just with an additional child class brmsfit_multiple
and, changing what is currently done right now, store all data sets in a list. The main decision problem I see is how to change the output of the methods. Most convenient would be just a list of output but there could be better options.
When I try to use
bayes_R2()
on abrmsfit_multiple
object, I get a warning message that only the first imputed data set was used. I assume this means that R2 was only calculated using the first imputed dataset. I asked about this on the mc-stan discourse and it seems that R2 can be pooled from the combined posterior samples just like many other parameters. Would it be possible to add that functionality to thebayes_R2()
method? I tried to look into doing it myself, but was struggling to understand thefit$fit
S4 object.