paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.27k stars 182 forks source link

support joint log score in kfold #1607

Closed avehtari closed 6 months ago

avehtari commented 6 months ago

Currently, kfold() computes pointwise log score (elpd_loo). For leave-one-group-out cross-validation, it would be useful to have an option to compute joint log score for each group. Instead of returning pointwise elpd for each observation, this would return groupwise joint elpd (eljpd) for each group. Given those, the rest of the summary and model comparison functions should work as they are now, although some docs and messages could be refined.

paul-buerkner commented 6 months ago

Good idea! Will add that to the next release.

paul-buerkner commented 6 months ago

Actually, I think brms supports this already via the argument joint that I recently added following our LGO-CV discussion. Can you check if that is what you mean?

avehtari commented 6 months ago

Ah, missed that. I checked the web page doc and issues, but not the doc for the github version. Does that work if K is less than the number of groups?

paul-buerkner commented 6 months ago

Ah, you want to have a joint by group not (necessarily) by K?

avehtari commented 6 months ago

Yes, as in the case of a large number of groups, the computation can take a long time

wds15 commented 6 months ago

Would this allow to do leave-one-patient out CV if I get this right... which would be super useful.

Does this need extra considerations if the different groups have different number of observations? In that case one would want to account for that, right?

Also - the way stratification is done wrt to other covariates may need to be reconsidered if one has a grouping factor with unequal sizes. The covariates would in many cases only vary by group as these are mostly baseline covariates (in the context of a randomised clinical trial).

avehtari commented 6 months ago

Would this allow to do leave-one-patient out CV if I get this right..

Yes. I was testing with Nabiximols case study, and I run out of memory with save_fits=TRUE with 2 models having 105 folds each, but I'd like to make predictions, too.

Does this need extra considerations if the different groups have different number of observations? In that case one would want to account for that, right?

That should be an option.

For example the following

kfold2b <- kfold(fit_betabinomial2b, group="id", folds=kfold_split_grouped(20, droplevels(cu_df_b$id)))

or maybe even

kfold2b <- kfold(fit_betabinomial2b, group="id", K=20)

would do 20-fold-CV, but would return 105 joint elpds, The default would be to return the simple joint, so that if there are no predictive dependencies then the sum of joint elpds is the same as sum of pointwise elpds.

paul-buerkner commented 6 months ago

No works on github. Below is an example:

fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
            data = epilepsy, family = poisson())

kfold(fit1, folds = "group", joint = "group", group = "patient")