stan-dev / projpred

Projection predictive variable selection
https://mc-stan.org/projpred/
Other
110 stars 26 forks source link

Error when getting reference model with k-fold cross validation for cv_varsel #485

Closed adietzel closed 7 months ago

adietzel commented 7 months ago

I am revisiting a snippet of code in my analysis that used to work with previous versions of brms and projpred (Projpred Issue #328, and on a different machine after making switch from Windows to Mac.

My work flow with reprex below is: 1) Create custom stratiefied cross-validation folds with loo::kfold_split_stratified(), 2) Refit the model with CV using brms::kfold() 3) Create a custom reference model with the cvfits 4) Select relevant variables with cv_varsel()

The syntax used here for specifying a custom k-fold cv ref model is also shown in #1286

library(brms) # version 2.20.4
library(projpred) # version 2.7.0

myK <- 2

data <- data.frame(
  Pres = sample(c(0, 1), 100, replace = TRUE),
  X1 = rnorm(100, 0, 1),
  X2 = rnorm(100, 0.2, 1))

RefMod <- brm(
  formula = Pres ~ X1 + X2,
  family = bernoulli(),
  data = data)

ids <- loo::kfold_split_stratified(K = myK, x = data$Pres)
myfits_K <- brms::kfold(RefMod, folds = ids, K = myK, save_fits = TRUE)

RefMod_CV <- brms:::get_refmodel.brmsfit(
  object = RefMod,
  cvfits = structure(list("fits" = myfits_K$fits[, "fit"]),
                     "K" = myK, "folds" = ids))

mycvvs <- cv_varsel(
  RefMod_CV,
  method = "forward",
  cv_method = "kfold",
  K = myK,
  cvfits = RefMod_CV$cvfits,
  seed = 86132035)

When running cv_varsel() I get the following error message:

"in family$linkinv(pred_sub) : Argument eta must be a nonempty numeric vector In addition: Warning message: In parse_args_cv_varsel(refmodel = refmodel, cv_method = cv_method, : The content of cvfits's sub-list called fits should be moved one level up (and element fits removed). The old structure will continue to work for a while, but is deprecated."

I would appreciate any help with understanding the main error message and also with how to set the ref model correctly now.

Thanks for your help!

Andreas

Also posted as issue on brms github page #1572

fweber144 commented 7 months ago

Yes, this is a known issue with the deprecated structure of cvfits that was fixed by commit b5bdc3175b285638b935c003d64cc991d140f82e. I plan to release a new CRAN version this week and this fix will be included there. Despite this, as mentioned in the warning message, you are using the deprecated structure of cvfits. See the docs for argument cvfits at https://mc-stan.org/projpred/reference/refmodel-init-get.html for the new structure.

fweber144 commented 7 months ago

Closing now since this was fixed by commit b5bdc3175b285638b935c003d64cc991d140f82e.

adietzel commented 7 months ago

Thanks a lot!

Changing the code to

RefMod_CV <- brms:::get_refmodel.brmsfit(
  object = RefMod,
  cvfits = structure(myfits_K$fits,
                     folds = attr(myfits_K, "folds")))

as suggested by the commit fixed the issue. A ll terms are selected but I do get another error message that I can't make sense of:

Running the search using the full dataset ... 50% of terms selected 100% of terms selected

Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "c('integer', 'numeric')"

Is there another issue with my code or will this issue be resolved by the next CRAN release? Would appreciate your thoughts.

Thanks a lot for your help

fweber144 commented 7 months ago

I would need a reprex for this

adietzel commented 7 months ago

Here is the one from above with the changes to the cvfits structure:

library(brms) # version 2.20.4
library(projpred) # version 2.7.0

myK <- 2

data <- data.frame(
  Pres = sample(c(0, 1), 100, replace = TRUE),
  X1 = rnorm(100, 0, 1),
  X2 = rnorm(100, 0.2, 1))

RefMod <- brm(
  formula = Pres ~ X1 + X2,
  family = bernoulli(),
  data = data)

ids <- loo::kfold_split_stratified(K = myK, x = data$Pres)
myfits_K <- brms::kfold(RefMod, folds = ids, K = myK, save_fits = TRUE)

RefMod_CV <- brms:::get_refmodel.brmsfit(
  object = RefMod,
  cvfits = structure(myfits_K$fits,
                     folds = attr(myfits_K, "folds")))

mycvvs <- cv_varsel(
  RefMod_CV,
  method = "forward",
  cv_method = "kfold",
  K = myK,
  cvfits = RefMod_CV$cvfits,
  seed = 86132035)
fweber144 commented 7 months ago

You do not specify cvfits as required. Either use the new structure (see https://github.com/stan-dev/projpred/issues/485#issuecomment-1852238838) or use the deprecated structure (which will not be guaranteed to work forever), but then you need the fixed projpred version.

fweber144 commented 7 months ago

In any case, you may also use run_cvfun() added in projpred 2.7.0 (see the corresponding NEWS entry).

adietzel commented 7 months ago

Thanks for pointing me in the right direction!

Code below now works for those interested. Following Frank's advice I created the cvfits object with the function run_cvfun().

To provide an alternative approach based on my original code, I also had another look at the structure required for the cvfits object. Since I am working with custom stratified cross validation folds, the second approach is more flexible than using run_cvfun(). Though I am not sure whether custom CV folds can also be created with run_cvfun() or alternative functions.

library(brms) # version 2.20.4
library(projpred) # version 2.7.0

myK <- 2

data <- data.frame(
  Pres = sample(c(0, 1), 100, replace = TRUE),
  X1 = rnorm(100, 0, 1),
  X2 = rnorm(100, 0.2, 1))

RefMod <- brm(
  formula = Pres ~ X1 + X2,
  family = bernoulli(),
  data = data)

# Either or but first won't work with custom folds to my knowledge
cvfits <- run_cvfun(RefMod, K = myK, save_fits = TRUE)

# Custom stratified folds
ids <- loo::kfold_split_stratified(K = myK, x = data$Pres)
myfits_K <- brms::kfold(RefMod, folds = ids, K = myK, save_fits = TRUE)
cvfits = structure(myfits_K$fits[, "fit"],
                   "folds" = ids)

# Variable selection
mycvvs <- cv_varsel(
  RefMod,
  method = "forward",
  cv_method = "kfold",
  K = myK,
  cvfits = cvfits,
  seed = 86132035)
fweber144 commented 7 months ago

Though I am not sure whether custom CV folds can also be created with run_cvfun() or alternative functions.

This was implemented in https://github.com/stan-dev/projpred/pull/480 and hence will also be included in the upcoming CRAN release (if you need this feature until that CRAN release is completed, you can use the GitHub version): https://github.com/stan-dev/projpred/blob/1bce71e5f030f4934ba915e58da6db40b9c93db8/NEWS.md?plain=1#L33

Btw, save_fits = TRUE can be omitted from the run_cvfun() call (gets ignored anyway).