stan-dev / projpred

Projection predictive variable selection
https://mc-stan.org/projpred/
Other
110 stars 26 forks source link

Projection onto the full model for multilevel Gaussian models #323

Open fweber144 opened 2 years ago

fweber144 commented 2 years ago

For multilevel Gaussian models, the projection onto the full model could be instable, if not even incorrect:


# Setup -------------------------------------------------------------------

default_post_warmup <- 1000
ndraws_ref <- 30
thin_ref <- ceiling(default_post_warmup / ndraws_ref)
set.seed(856824715)

# For simulating a dataset:
dataconstructor <- function() {
  nobsv <- 750L
  icpt <- -0.2

  npreds_cont <- 5L
  coefs_cont <- seq(-4, 4, length.out = npreds_cont)

  ngrPL <- 3L
  coefs_grPL <- c(0, seq(-2, 2, length.out = ngrPL - 1L))

  # Number of observations per group:
  ngrGL <- 375L
  coefs_grGL <- list("icpt" = rnorm(ngrGL, sd = 1.5))

  # Continuous predictors:
  dat_sim <- setNames(replicate(npreds_cont, {
    rnorm(nobsv, sd = 0.5)
  }, simplify = FALSE), paste0("Xcont", seq_len(npreds_cont)))
  dat_sim <- as.data.frame(dat_sim)
  eta <- icpt + drop(as.matrix(dat_sim) %*% coefs_cont)

  # Population-level (PL) categorical predictor:
  dat_sim$XgrPL1 <- sample(
    gl(n = ngrPL, k = floor(nobsv / ngrPL), length = nobsv,
       labels = paste0("gr", seq_len(ngrPL)))
  )
  eta <- eta + coefs_grPL[dat_sim$XgrPL1]

  # Group-level (GL) categorical predictor:
  dat_sim$XgrGL1 <- sample(
    gl(n = ngrGL, k = floor(nobsv / ngrGL), length = nobsv,
       labels = paste0("gr", seq_len(ngrGL)))
  )
  eta <- eta + coefs_grGL$icpt[dat_sim$XgrGL1]

  # Response:
  dat_sim$Y <- rnorm(nobsv, mean = eta, sd = 1.6)

  # Formula:
  voutc <- "Y"
  vpreds <- grep("^Xcont|^XgrPL", names(dat_sim), value = TRUE)
  vpreds_GL <- grep("^XgrGL", names(dat_sim), value = TRUE)
  vpreds_GL <- paste0("(1 | ", vpreds_GL, ")")
  fml_sim <- as.formula(paste(
    voutc, "~", paste(c(vpreds, vpreds_GL), collapse = " + ")
  ))

  # Output:
  return(list(true_GLEs = coefs_grGL$icpt,
              dat = dat_sim,
              fml = fml_sim))
}

# For fitting the reference model and projecting it onto itself, i.e., onto the
# full model:
ref_prj <- function(dat, fml, ...) {
  refm_fit <- rstanarm::stan_glmer(
    formula = fml,
    family = gaussian(),
    data = dat,
    chains = 1,
    thin = thin_ref,
    QR = TRUE,
    refresh = 0
  )
  soltrms <- labels(terms(fml))
  soltrms[grep("\\|", soltrms)] <- paste0("(", soltrms[grep("\\|", soltrms)], ")")
  ### For projecting onto a submodel, not the full model:
  # soltrms <- setdiff(soltrms, grep("^Xcont", soltrms, value = TRUE))
  ###
  prj <- projpred::project(refm_fit, solution_terms = soltrms, ...)
  sd_colnm <- "Sigma[XgrGL1:(Intercept),(Intercept)]"
  return(list(
    sd_ref = as.matrix(refm_fit)[, sd_colnm],
    sd_prj = as.matrix(prj)[, sd_colnm]
  ))
}

# Run ---------------------------------------------------------------------

sim_dat_etc <- dataconstructor()
# Debug so that the lme4::lmer() call inside of projpred:::fit_glmer_callback()
# can be run without suppressed warnings and messages:
debug(projpred:::fit_glmer_callback)
simres <- ref_prj(dat = sim_dat_etc$dat,
                  fml = sim_dat_etc$fml,
                  ndraws = ndraws_ref)

The instability can be seen from the warnings

Warning messages:
1: In optwrap(optimizer, devfun, getStart(start, rho$pp), lower = rho$lower,  :
  convergence code -4 from nloptwrap: NLOPT_ROUNDOFF_LIMITED: Roundoff errors led to a breakdown of the optimization algorithm. In this case, the returned minimum may still be useful. (e.g. this error occurs in NEWUOA if one tries to achieve a tolerance too close to machine precision.)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  unable to evaluate scaled gradient
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge: degenerate  Hessian with 1 negative eigenvalues

shown when debugging as outlined in the reprex above and from comparing simres$sd_prj to simres$sd_ref.

This instability does not seem to occur when projecting onto an actual submodel (e.g., by uncommenting soltrms <- setdiff(soltrms, grep("^Xcont", soltrms, value = TRUE)) in the reprex above).

The reason for the instability could be that in the projection onto the full model (i.e., in the lme4::lmer() fit), the residual SD is zero. So far, I did not encounter a similar issue for non-multilevel Gaussian models fitted via projpred:::fit_glm_ridge_callback() or via projpred:::fit_glm_callback(). So this could indeed be restricted to multilevel models.

This issue shows that—at least as a first step—the convergence of the algorithms used for fitting the submodels needs to be checked. For this, see the code started in PR #259. As a longer-term objective, I think we need to investigate this issue via simulation to find out if it's really an issue and—if yes—fix it.

fweber144 commented 1 year ago

Minor update: When re-running the reprex above, I now get the following warnings:

Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.225752 (tol = 0.002, component 1)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?

Nevertheless, I think this still illustrates the issue.

(Argument nAGQ was removed as it is only relevant for lme4::glmer().)

fweber144 commented 1 year ago

Further update: As mentioned at https://discourse.mc-stan.org/t/cv-varsel-error-infinite-or-missing-values-in-x/31703/8, the projection of a Gaussian multilevel reference model onto the full model can even lead to an error.