paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.29k stars 187 forks source link

Issue with `index` feature #1497

Open wds15 opened 1 year ago

wds15 commented 1 year ago

The index feature appears to have some issues with its interactions with emmeans. Here is an example:

Update of reprex below on 2023-05-17:

##library(brms)
devtools::load_all("~/rwork/brms") ## this is the latest as of 20. April 2023
#> ℹ Loading brms
#> Loading required package: Rcpp
#> 
#> Loading 'brms' package (version 2.19.3). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> 
#> The following objects are masked from 'package:brms':
#> 
#>     all_vars, collapse, filter, lag, rename, slice
#> 
#> The following object is masked from 'package:testthat':
#> 
#>     matches
#> 
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> 
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(here)
library(emmeans)
#> Warning: package 'emmeans' was built under R version 4.1.1

# instruct brms to use cmdstanr as backend and cache all Stan binaries
options(brms.backend="cmdstanr", cmdstanr_write_stan_file_dir=here("brms-cache"))
# create cache directory if not yet available
dir.create(here("brms-cache"), FALSE)

set.seed(346356)
# make some baseline values missing
epi_base <- unique(epilepsy[c("patient", "Base")]) %>%
    mutate(patient_idx=as.integer(patient),
           patient=NULL,
           missing=rbinom(n(), 1, 0.3),
           Base_obs=if_else(missing==1, rep(NA_real_, n()), Base),
           ref_visit=TRUE)

epilepsy_obs <- epilepsy %>%
    mutate(patient_idx=as.integer(patient),
           patient=NULL) %>%
    left_join(select(epi_base, patient_idx, Base_obs)) %>%
    mutate(ref_visit=visit==1)
#> Joining, by = "patient_idx"

sum(epilepsy_obs$ref_visit) == max(epilepsy_obs$patient_idx)
#> [1] TRUE

model <- bf(count ~ Trt + mi(Base_obs, patient_idx) + visit,
            autocor = ~unstr(time=visit, gr=patient_idx)) +
    bf(Base_obs | mi() + subset(ref_visit) + index(patient_idx) ~ 1) +
    set_rescor(FALSE)

model_prior <- prior(normal(0, 5), class=Intercept, resp=count) +
    prior(normal(0, 1), class=b, resp=count) +
    prior(normal(0, 5), class=Intercept, resp=Baseobs)

fit <- brm(model, prior=model_prior, data = epilepsy_obs, seed=346465, refresh=0, cores=4)
#> Start sampling
#> Running MCMC with 4 parallel chains...
#> 
#> Chain 2 finished in 3.6 seconds.
#> Chain 3 finished in 3.6 seconds.
#> Chain 1 finished in 3.9 seconds.
#> Chain 4 finished in 3.9 seconds.
#> 
#> All 4 chains finished successfully.
#> Mean chain execution time: 3.7 seconds.
#> Total execution time: 3.9 seconds.

## PROBLEM: emmmeans does not work?!
emmeans(fit, ~ Trt | visit, resp="count", weights="proportional")
#> Error: Index of response 'Baseobs' contains duplicated values.
## error: Error: Index of response 'Baseobs' contains duplicated values.

## does work now with current develop version of brms as of 2023-05-16
## => what is being imputed for Base_obs? Does not work with 2.19.0
pe <- posterior_epred(fit, resp="count")
dim(pe)
#> [1] 4000  236

pp <- posterior_predict(fit, resp="count")
dim(pp)
#> [1] 4000  236
dim(epilepsy_obs)
#> [1] 236  11

## free extra matter: I think that this requirement must be guranteed
## for things to work correctly...maybe check this? This assumes that
## the order of things must be correct (so when subsetting the
## data-set, then the index must be consecutivley ordered...or is
## there some magic done?) Does brms do a more clever thing which
## treats the index column more like a key rather than an index?
epilepsy_obs_ref <- subset(epilepsy_obs, ref_visit)
epilepsy_obs_ref$patient_idx == 1:nrow(epilepsy_obs_ref)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## this seems to work, but I am not sure what is being predicted here
## for the rows with a missing baseline observataion
pen <- posterior_epred(fit, resp="count", newdata=epilepsy_obs)
ppn <- posterior_predict(fit, resp="count", newdata=epilepsy_obs)

## here is what I would expect to happen... for each draw:
## get a prediction for the missing things:
is_missing <- is.na(epi_base$Base_obs)
ppn1_bo <- posterior_predict(fit, resp="Baseobs", newdata=epi_base, draw_ids=1)
ppn1_bo[is_missing]
#>  [1] -25.834752  64.303117 -38.948466  -2.964165 -47.528959  67.958672
#>  [7]  15.870559  26.695454   1.954314  23.716118  51.330793  23.544970
#> [13]  35.659669  24.222133  47.609580  24.615045   2.432574  67.394192
#> [19] -19.652742
## then impute the data set with this data
epi_base_imp <- epi_base
epi_base_imp$Base_obs[is_missing] <- ppn1_bo[is_missing]

epilepsy_obs_imp <- mutate(epilepsy_obs, Base_obs=NULL) %>%
    left_join(select(epi_base_imp, patient_idx, Base_obs))
#> Joining, by = "patient_idx"

## now use the same draw id and do a prediction for count using the
## respectivley imputed Base_obs
ppn1_count <- posterior_predict(fit, resp="count", newdata=epilepsy_obs_imp, draw_ids=1)
##ppn1_count

Created on 2023-05-17 with reprex v2.0.2

paul-buerkner commented 1 year ago

Thank you. Let's think about some good ("natural") behavior of index models for new data. Once we have that, I think all of the current problems will be resolved.

wds15 commented 1 year ago

I updated the reprex above to show what I would expect to happen (note that there are small changes to the data-sets at the top of the script). This is what would mimic the logic as it happens during the fitting process... but I guess that there could be other choices as well such that considering an option for what is happening (now or at a later stage) is likely sensible. Does it make sense to you what I wrote in more or less dummy coding (though it works).

wds15 commented 1 year ago

BTW, I think I have found a workaround for getting what I want, which is imputation with uncertainty: One can simply include the data for which one wants predictions in the data set being fitted, but declare the outcome to be NA (and allow missing in the main outcome of the model). Doing so then allows me to do a posterior_predict for these entries, since the sampler does the by-draw imputation and these are then being used by brms to do the prediction...right? This solution makes fitting slower and does not really scale, but it does work in small problems at least.

paul-buerkner commented 1 year ago

Yes, but then you have to feed back a lot of data sets (one per draw) into posterior_predict, right? Which creates a lot of computational overhead. I agree that this is currently the most viable (but still suboptimal solution) if I understand correctly what you mean.

wds15 commented 1 year ago

It's brute force, yes. It does work ok for the small data sets I am working on right now ok...but for anything bigger it's a pain, yes! It's cumbersome as you need to code up the data-sets with great care in terms of not only what you want to fit, but also what you want to predict (and, yes, the posterior_predict always has to go over the entire data-set).