stan-dev / projpred

Projection predictive variable selection
https://mc-stan.org/projpred/
Other
110 stars 26 forks source link

Using projpred with brm_multiple and mice #275

Open HenrikEckermann opened 2 years ago

HenrikEckermann commented 2 years ago

Hi,

I tried to use projpred after multiple imputation using the mice package. It did not produce results as expected (see here for plots of the output. I suspected that this was due to a combination of lack of predictive power and because of multiple imputation outside of the statistical model with mice may not be supported. @avehtari confirmed to me that multiple imputation might not yet be supported and to open an issue here.

In case it is in fact supported already: Below code reproduces the output. I fit random noise with and without using multiple imputation. I would expect that RMSE is ~equal between models thanks to the horseshoe prior. This is the case without imputations but when using multiple imputations RMSE increases when more variables are included.

One more thing: Note that models have divergent transitions. In my real dataset I circumvented that by using different priors that are less regularizing (normal(0, 0.2) for all betas). I am aware that the horseshoe prior is important to prevent overfitting but I did not find a way yet to fit the models without divergent transitions when using it in my dataset (any hints are welcome if there are better ways to circumvent that).


  library(tidyverse)
  library(brms)
  library(projpred)
  library(mice)

  # simulate data where all 25 beta are 0
  set.seed(1)
  n <- 75
  n_feat <- 25
  X <- map_dfc(1:n_feat, function(x) {
    df <- tibble(rnorm(n, 0, 2))
    colnames(df) <- glue::glue("var_{x}")
    df
  })
  y <- rnorm(n, 0, 1)
  df <- bind_cols(tibble(y = y), X)

  # projpred 
  n <- nrow(df) # 75
  D <- ncol(df[, -1]) # 25
  p0 <- 3 # prior guess for the number of relevant variables
  tau0 <- p0/(D-p0) * 1/sqrt(n) 
  m1 <- brm(
    y ~ ., 
    family = gaussian(), 
    data = df,
    prior = prior(horseshoe(scale_global = tau0, scale_slab = 1), class = b),
    seed = 1, 
    chains = 2, 
    iter = 2000,
    # control = list(adapt_delta = 0.8)
  )
  refmodel <- get_refmodel(m1)
  vs <- cv_varsel(refmodel, method = "forward", cv_method = "LOO")
  plot(vs, stats = c("elpd", "rmse"))

  # create ~ 30% random missingness per var
  for (row in seq_along(df$var_1)) {
    for (col in seq_along(colnames(df))) {
      # ~ 30% missingness
      if (runif(0, 1, n = 1) > 0.7) {
        df[row, col] <- NA
      }
    }
  }

  # 10 imputations (predictive mean matching)
  dfimp <- mice(df, m = 10)

  # use multple imputed datasets with projpred 
  m2 <- brm_multiple(
    y ~ ., 
    family = gaussian(), 
    data = dfimp,
    prior = prior(horseshoe(scale_global = tau0, scale_slab = 1), class = b),
    seed = 1, 
    chains = 2, 
    iter = 2000,
    # control = list(adapt_delta = 0.8)
  )

  refmodel2 <- get_refmodel(m2)
  vs2 <- cv_varsel(refmodel2, method = "forward", cv_method = "LOO")
  plot(vs2, stats = c("elpd", "rmse"))

}
avehtari commented 2 years ago

Thanks for creating the issue

fweber144 commented 2 years ago

Yes, thanks for the issue. Output from brm_multiple() is indeed not supported by projpred and brms, at least when using the output from brm_multiple() directly as input to get_refmodel(). So this issue is about raising an appropriate message/warning/error. When running line refmodel2 <- get_refmodel(m2) from your example, I already get

Warning message:
Using only the first imputed data set. Please interpret the results with caution until a more principled approach has been implemented.

For now, I would consider this as sufficient. I don't know if you got that warning, too. Might depend on the brms version you are using. I am using brms's current GitHub version (commit 8ba9e53). A Git Blame for the relevant lines in brms (https://github.com/paul-buerkner/brms/blob/8ba9e5333729e9d27e72a6fc81d746b4bb0d6251/R/brm_multiple.R#L231-L239 and https://github.com/paul-buerkner/brms/blob/8ba9e5333729e9d27e72a6fc81d746b4bb0d6251/R/prepare_predictions.R#L22) shows that brms should have thrown such a warning for about 3 years, but there might have been other lines I am currently not aware of which inhibited this or caused prepare_predictions.brmsfit() not to be called in get_refmodel.brmsfit(). Anyway, the important point is that the upcoming CRAN version of brms should throw this warning.

Thanks again and I'll think about documenting the missing support for brm_multiple().

fweber144 commented 1 year ago

Sorry, I should have left this open as a feature request! Re-opening now and adding the correct label.

fweber144 commented 1 year ago

Related: https://discourse.mc-stan.org/t/projpred-model-with-horseshoe-prior-does-not-converge/29935