stan-dev / projpred

Projection predictive variable selection
https://mc-stan.org/projpred/
Other
110 stars 26 forks source link

L1 search and `I()` terms #404

Closed fweber144 closed 1 year ago

fweber144 commented 1 year ago

During an L1 search, I() terms may cause an error:

N <- 41L
K <- 5L
K_fac <- 4L
set.seed(457324)
dat <- data.frame(
  y = rnorm(N),
  xcat = gl(n = K, k = floor(N / K), length = N,
            labels = paste0("gr", seq_len(K))),
  xfac = sample(gl(n = K_fac, k = floor(N / K_fac), length = N,
                   labels = paste0("fgr", seq_len(K_fac)))),
  xlog = sample(rep_len(c(TRUE, FALSE), length.out = N))
)
levels(dat$xfac) <- c(levels(dat$xfac),
                      paste0("fgr", (K_fac + 1L):(K_fac + 2L)))
dat$xcat <- as.character(dat$xcat)

library(rstanarm)
rfit <- stan_glm(y ~ xcat + xfac + I(!xlog),
                 data = dat,
                 seed = 1140350788,
                 chains = 1, iter = 500,
                 refresh = 0)

library(projpred)
# debug(projpred:::search_L1)
# debug(projpred:::collapse_contrasts_solution_path)
cvvs <- cv_varsel(rfit,
                  ### The issue does not occur with forward search:
                  # method = "forward",
                  ###
                  nclusters = 1,
                  nclusters_pred = 1,
                  seed = 46782345)

giving

Error in str2lang(x) : <text>:1:20: unexpected numeric constant
1: . ~ xfac + I(!xlog)TRUE
                       ^

The issue seems to be that collapse_contrasts_solution_path() does not escape all special symbols for regular expressions (only +): https://github.com/stan-dev/projpred/blob/a6ee4f9d11d20679f9207e974905fb6f8cbf0515/R/formula.R#L757-L783 This might be related to #183, perhaps also #182.