stan-dev / projpred

Projection predictive variable selection
https://mc-stan.org/projpred/
Other
109 stars 26 forks source link

Subsampled LOO #94

Open fweber144 opened 3 years ago

fweber144 commented 3 years ago

In cv_varsel.refmodel(), the observation-specific weights are hard-coded to be all equal (line https://github.com/stan-dev/projpred/blob/0cabd73548eec32e9b675f15cd39d377a40f180a/R/cv_varsel.R#L221 ), but in case of subsampling LOO-CV, they might be different (as the comment says there, too). Thus, shouldn't that line be modified?

AlejandroCatalina commented 3 years ago

Yes. This is a bit less straightforward to fix as we can compute both approximate LOO or full LOO, so I think the right weights should be returned by loo_varsel and kfold_varsel themselves. I'll work on this after resolving the easier issues.

fweber144 commented 2 years ago

Related comments/discussions may be found in issue #173 (enumeration point 4) and PR #188. Now that I understand the purpose of w better, I think constant weights for LOO-CV are ok (no matter whether subsampled or not) but they might be inappropriate for K-fold CV with folds of very different sizes.

fweber144 commented 2 years ago

I have now read Magnusson et al. (2019) and Magnusson et al. (2020), but I'm still confused how projpred wants to perform the subsampled LOO CV. More importantly, I have the strong feeling that the current implementation is not correct. What I learned from Magnusson et al. (2019) and Magnusson et al. (2020) makes me think the problems with the implementation in projpred get deeper than I initially thought.

What is important for projpred is that:

  1. Magnusson et al. (2019) propose to perform probability-proportional-to-size sampling (PPS) combined with the Hansen-Hurwitz (HH) estimator.
  2. Magnusson et al. (2020) propose to perform simple random sampling (SRS) combined with the difference estimator. The reason for this new proposal is that the approach by Magnusson et al. (2019) is not optimal for model comparisons (as explained by Magnusson et al., 2020).

However, in my opinion, neither the approach by Magnusson et al. (2019) nor that by Magnusson et al. (2020) is currently used in projpred. The key equation from Magnusson et al. (2019) is equation (4). The key equation from Magnusson et al. (2020) is equation (7). However, I can't find any of these two equations in projpred. The appropriate place would probably be in lines https://github.com/stan-dev/projpred/blob/dd4bc36aba24ea680050f680d4c1cf643a3680fc/R/summary_funs.R#L171-L179 (these lines only refer to the ELPD as performance statistic; projpred has more performance statistics, of course), but this does not correspond to equation (4) from Magnusson et al. (2019) and neither to equation (7) from Magnusson et al. (2020). The SE estimators from these lines also don't correspond to the SE estimators from Magnusson et al. (2019, equation (6)) and Magnusson et al. (2020, equation (9)). The subsampling itself (performed in lines https://github.com/stan-dev/projpred/blob/dd4bc36aba24ea680050f680d4c1cf643a3680fc/R/cv_varsel.R#L839-L840 ) also confuses me. For example, for the approach by Magnusson et al. (2019), I would have expected the logarithms of the predictive densities (not the predictive densities themselves) in the definition of w, and then sampling with replacement.

So if the current implementation of subsampled LOO CV in projpred is not correct, then I think projpred should not support subsampled LOO CV until this is fixed.

Of course, I could be totally wrong and neither Magnusson et al. (2019) nor Magnusson et al. (2020) are relevant for subsampled LOO CV in projpred. But then, I would be even more confused because there is this comment: https://github.com/stan-dev/projpred/blob/dd4bc36aba24ea680050f680d4c1cf643a3680fc/R/cv_varsel.R#L821-L825 which indicates that projpred wants to use the approach by Magnusson et al. (2019).

If my assumption is correct and projpred needs a fix, then a way to implement either the approach by Magnusson et al. (2019) or that by Magnusson et al. (2020) might be to rely on loo::loo_subsample() (see this vignette). This is just a rough idea, though, and I don't know if it is really feasible.

Of course, given these new findings, my statement

[...] I think constant weights for LOO-CV are ok (no matter whether subsampled or not) [...]

from the comment above (see also enumeration point 4 in #173) is not correct. At that time, I hadn't realized yet that projpred (probably) wants to use the approach by Magnusson et al. (2019) which requires sampling with replacement, which in turn means that each observation can occur more than once in the subsample (in contrast to my statement

[...] after all, they all represent a single observation [...]

from #188) so that when aggregating across the subsampled LOO CV folds, non-constant weights make sense (but note that these would have to be inverse-probability weights, in contrast to the currently used non-inversed probability weights).

References

Magnusson, M., Andersen, M., Jonasson, J., and Vehtari, A. (2019). Bayesian leave-one-out cross-validation for large data. In Proceedings of the 36th International Conference on Machine Learning, 4244-4253. URL: https://proceedings.mlr.press/v97/magnusson19a.html.

Magnusson, M., Vehtari, A., Jonasson, J., and Andersen, M. (2020). Leave-one-out cross-validation for Bayesian model comparison in large data. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 341-351. URL: https://proceedings.mlr.press/v108/magnusson20a.html.

fweber144 commented 2 years ago

An alternative to not supporting subsampled LOO CV at all anymore might be to throw a warning concerning a potentially incorrect implementation.

fweber144 commented 1 year ago

See also

I also noticed that <vsel_object>$summaries$ref$w (so w for the reference model) is always NULL which is strange because it looks like lines https://github.com/stan-dev/projpred/blob/33047d6c9ea98d0644cb0c8503424f61ab0127af/R/summary_funs.R#L76-L78 expect a non-NULL element w also for the reference model (which would also make sense if w is supposed to contain the weights of the CV folds the observations are in because the reference model's performance is also cross-validated if the submodels' performances are).

in this comment. When fixing subsampled LOO CV, I guess that also needs to be fixed.

fweber144 commented 1 year ago

More issues/questions concerning subsampled LOO CV:

  1. NAs in get_stat()'s baseline objects mu.bs and lppd.bs are not handled consistently across the different stats (I think the handling from RMSE and AUC also needs to be used for all other stats). Furthermore, I think n_notna needs to be updated according to the NAs in the baseline objects mu.bs and lppd.bs if baseline = "best". I have now (PR #350) also added n_notna.bs which could perhaps be re-used.
  2. For the AUC, the modified mu doesn't even seem to be used (see lines https://github.com/stan-dev/projpred/blob/e7080db29e5d77c628b067a97a6da80b72a69156/R/summary_funs.R#L295-L307).
  3. It seems like loo_varsel()$d_test always contains all observations as test points, but line https://github.com/stan-dev/projpred/blob/e7080db29e5d77c628b067a97a6da80b72a69156/R/summary_funs.R#L103 doesn't reduce this down to the subsampled observations. Is this on purpose?
fweber144 commented 8 months ago

The comment from https://github.com/stan-dev/projpred/issues/94#issuecomment-1195207164 has been addressed by commit 2bd6b4efdd6a2f327d73c563da6f6fe451f4470d (within PR #475).

Question 1 from https://github.com/stan-dev/projpred/issues/94#issuecomment-1246694046 has now been addressed by commit 7232148561c13475d13392a7ae841ceb39d2b80c (within PR #475).

Question 2 from https://github.com/stan-dev/projpred/issues/94#issuecomment-1246694046 has now been addressed by commit d109c371de960ddf1e2567eb9e4c0fc5e68339d9 (within PR #475).

Question 3 from https://github.com/stan-dev/projpred/issues/94#issuecomment-1246694046 doesn't need to be addressed (i.e., it can be ignored) because that behavior was probably indeed on purpose (non-subsampled observations have NA as predicted value and so they will be ignored by get_stat()).

fweber144 commented 8 months ago

For the future: In the code, I have added comments starting with TODO (subsampled PSIS-LOO CV) which need to be addressed. However, addressing these comments does not guarantee a correctly behaving subsampled PSIS-LOO CV (in particular, there are probably deeper methodological issues that need to be addressed as well, see above).

fweber144 commented 8 months ago

And a question to @AlejandroCatalina: Is it on purpose that the bootstrap samples are drawn from the set of all observations (i.e., with the NA observations included)? Couldn't it also make sense to draw the bootstrap samples from only the subsampled observations (i.e., the non-NA observations)?

AlejandroCatalinaF commented 8 months ago

Yeah it would be okay too I think

fweber144 commented 6 months ago

@avehtari: Still to do (or to check) here:

avehtari commented 6 months ago

(Sorry for not checking this 2 years ago!)

avehtari commented 6 months ago

Here is a demonstration of subsampling LOO with simple random sampling and difference estimator

library(rstanarm)
library(projpred)
library(tidyr)
library(ggplot2)

data("df_gaussian")
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)

# Prior guess for the number of relevant (i.e., non-zero) regression
# coefficients:
p0 <- 5
# Number of observations:
N <- nrow(dat_gauss)
# Hyperprior scale for tau, the global shrinkage parameter (note that for the
# Gaussian family, 'rstanarm' will automatically scale this by the residual
# standard deviation):
tau0 <- p0 / (D - p0) * 1 / sqrt(N)
( D <- sum(grepl("^X", names(dat_gauss))) )

set.seed(5078022)
refm_fit <- stan_glm(
  y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X14 +
    X15 + X16 + X17 + X18 + X19 + X20,
  family = gaussian(),
  data = dat_gauss,
  prior = hs(global_scale = tau0),
  ### Only for the sake of speed (not recommended in general):
  chains = 2, iter = 1000,
  ###
  refresh = 0
)

refm_obj <- get_refmodel(refm_fit)

# Fast with validate_search = FALSE
cvvs_fast <- cv_varsel(
  refm_obj,
  validate_search = FALSE,
  nterms_max = 20
)

plot(cvvs_fast, alpha=0.1, deltas=TRUE,
     text_angle=45, size_position = 'primary_x_top') +
  geom_vline(xintercept=seq(0,20,by=5), colour='black', alpha=0.1)

# For comparison purposes slow with validate_search = TRUE and no subloo
cvvs <- cv_varsel(
  cvvs_fast,
  validate_search = TRUE
)

plot(cvvs, alpha=0.1, deltas=TRUE, show_cv_proportions=FALSE,
     text_angle=45, size_position = 'primary_x_top') +
  geom_vline(xintercept=seq(0,20,by=5), colour='black', alpha=0.1)

# Subsampling LOO with difference estimator and simple resampling
# starts with the simple resampling. For demonstration, we use a
# subsample of nloo=25. nloo should not be too small, but in this case
# the full N is only 100, so we use quite small nloo for demonstration
# purposes.
nloo=25
set.seed(68)
# simple random sample
sub_inds=sample(1:N, nloo)

# For subsampling LOO with difference estimator we need surrogate
# results that are close to full loo results. In case of search we can
# expect that the fast submodel LOO with validate_search=FALSE to be a
# good surrogate.
#
cvvs_fast_perf <- performances(cvvs_fast)
cvvs_perf <- performances(cvvs)
# For Intercept only and the full model the validate_search=TRUE
# results are the same as with validate_search=FALSE.
elpd_diffe <- numeric()
k <- 0
elpd_diffe[k+1] = sum(cvvs_fast$summaries$sub[[k+1]]$lppd)
k <- 20
elpd_diffe[k+1] = sum(cvvs_fast$summaries$sub[[k+1]]$lppd)
elpd_sub <- elpd_diffe
# For the rest we use the difference estimator
for (k in 1:(20-1)) {
  # proxy is the the corresponding validate_search=FALSE result
  elpd_loo_hat <- cvvs_fast$summaries$sub[[k+1]]$lppd
  # For demonstration purposes, instead of running the search with
  # subsampling, we subsample results from the run with
  # validate_search=TRUE and nloo=N
  elpd_loo_sub <- cvvs$summaries$sub[[k+1]]$lppd[sub_inds]
  # The difference estimator
  elpd_diffe[k+1] <- sum(elpd_loo_hat) + N/nloo * sum(elpd_loo_sub - elpd_loo_hat[sub_inds])
  # As comparison compute the simple subsample average
  elpd_sub[k+1] <- N/nloo * sum(elpd_loo_sub)
}

# Compare 1) validate_search=FALSE elpd, 2) validate_search=TRUE with
# subsampling difference estimator elpd, 3) validate_search=TRUE with
# subsampling with subsample average
data.frame(k=0:20,
           fast=cvvs_fast_perf$submodels[,'elpd']-cvvs_perf$submodels[,'elpd'],
           diffe=elpd_diffe-cvvs_perf$submodels[,'elpd'],
           simple=elpd_sub-cvvs_perf$submodels[,'elpd'])|>
  pivot_longer(cols = !k,
               names_to = 'method',
               values_to = 'elpd') |>
  ggplot(aes(x=k, y=elpd, color=method)) +
  geom_line() +
  labs(y='difference to slow elpd')

image

avehtari commented 6 months ago

loo package has diff_srs_est() function, which estimates also the variance (needed for SEs), but the code has some mismatch with the corresponding paper, and the estimated variance can be negative, so waiting https://github.com/stan-dev/loo/issues/237 to be solved, before continuing here

fweber144 commented 5 months ago

Thanks for the clarifications. Now it's clear to me what projpred is supposed to do in case of subsampled LOO.

fweber144 commented 5 months ago

Concerning

loo_subsample_pps returns the weights, but they are not used elsewhere in the code

I think these weights (called wcv in the code) are used at some places:

I just wanted to add this for the sake of completeness. As you said, simple random sampling (SRS, combined with the difference estimator) doesn't require weights.