subsampling LOO estimates with diff-est-srs-wor start

avehtari commented 5 months ago

Work in progress.

Tested with

family="normal" and stats %in% c("elpd", "mlpd", "gmpd", "mse", "rmse")
family="bernoulli" and stats %in% c("acc","pctcorr")
family="poisson" usings trad and latent approaches
tests pass expect some plot tests fail with svglite: ... Graphics API version mismatch

Notes

n_loo matters only if validate_search = TRUE (with FALSE there is no speed advantage)
in cv_varsel if nloo<n and fast PSIS-LOO result is not yet available, fast PSIS-LOO result is computed
in cv_varsel if nloo<n, fast PSIS-LOO result is stored in slot $summaries_fast
the subsampling indices are stored in slot $loo_inds
the actual subsampling estimating happens in summary_funs.R get_stat()
removed some NA checking and need to recheck if those need to put back
improved quantiles for "mse" if the value is close to 0 (can't get negative lq)
"auc" is not supported (complicated)

add support for incrementally increasing nloo?

tagging @n-kall

avehtari commented 5 months ago

I wanted to use R2, and as I had rewrote summary stats anyway, added R2 and made all mse, rmse, and R2 to use only normal approximation with as much shared computation as possible

With the added R2 support, this PR will close also #483

n-kall commented 5 months ago

I can take a look

fweber144 commented 5 months ago

I have added some comments, but I'm not done with the review yet.

Besides, I think documentation needs to be updated (at least re-roxygenized, but also progressr should be mentioned), the vignettes perhaps as well, and I haven't run R CMD check (including the unit tests) yet. The NEWS file would also need to be updated.

avehtari commented 5 months ago

Thanks for all the comments.

I have run unit tests, but not yet R CMD check. I had not updated all docs, as I wanted to get approval on the behavior changes first. I'll start working on the doc changes next.

avehtari commented 3 months ago

Merged master which included new pareto-k diagnostics. There were several conflicts in cv_varsel.R, and I hope I did resolve them correctly

avehtari commented 3 months ago

re-running tests after the merge

avehtari commented 3 months ago

This is a new one

Error in FUN(): ! In path: "./projpred/tests/testthat/setup.R" Caused by error: ! (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate

avehtari commented 3 months ago

I get the same PIRLS error with the master branch, so it's not just this PR

fweber144 commented 3 months ago

That's strange. For me, the tests succeed on master. Can you locate where this error occurs in setup.R (and for which model)?

avehtari commented 3 months ago

traceback says line 1423 in setup.R. I tried to run line by line, but it's difficult as somehow some variables controlling what is executed keep changing to FALSE (e.g. run_vs and run_cvvs) and I'm not able to repeat what is excuted when calling devtools::test(), and now I did run out of time for today

fweber144 commented 3 months ago

I have now reviewed files glmfun.R, methods.R, misc.R, summary_funs.R, and varsel.R.

File cv_varsel.R (as well as DESCRIPTION and all of the test files) is still missing. Before proceeding with cv_varsel.R, I think I should read Magnusson et al. (2020) again because I read it long ago and have forgotten about many details. Also, as mentioned above (just for keeping track of this), some other things are still missing (documentation, vignettes, R CMD check (including unit tests), NEWS file).

fweber144 commented 3 months ago

Sorry, I forgot to mention that in summary_funs.R, I have not reviewed several parts yet:

The part following

# Compute mean and variance in log scale by matching the variance of a
# log-normal approximation
# https://en.wikipedia.org/wiki/Log-normal_distribution#Arithmetic_moments

The definition of .srs_diff_est_w() as well as all of its occurrences.

The whole case

else if (stat %in% c("mse", "rmse", "R2")) {

Reviewing these parts probably also makes sense only after having read Magnusson et al. (2020). Reviewing R2 and the changes for MSE and RMSE also requires some more time, I guess.

avehtari commented 3 months ago

Currently, I'm not able to run tests as discussed earlier
I fixed some easy things, but some will require more time as I can't directly see or remember whether some choices were intentional

avehtari commented 3 months ago

2. The definition of .srs_diff_est_w() as well as all of its occurrences. ... Reviewing these parts probably also makes sense only after having read Magnusson et al. (2020). Reviewing R2 and the changes for MSE and RMSE also requires some more time, I

Note that there is one typo in the paper as mentioned in https://github.com/stan-dev/loo/pull/238

I trust all the new delta-method and matching variance parts as the results do match well with the bootstrap results

fweber144 commented 3 months ago

Note that there is one typo in the paper as mentioned in https://github.com/stan-dev/loo/pull/238

Thanks. I'll take a look at that, too.

Concerning the tests: I just pushed commit 28e011789ea2b5ad2743c7821bc88b2ca2c76a77 to master because after upgrading CmdStan to v2.35.0, I received an error at a place where a tolerance was set too tight. Apart from that, the tests on branch master ran through on my machine. But since in general, Stan results depend on the machine, it could be that you are encountering an error while I don't. Nevertheless, this shows how fragile the tests currently are (especially for GLMM submodels where I think you are encountering the error; GAMs and GAMMs are probably similarly fragile). In the future, we could try to make the tests more stable. I think an important reason for the current instability is the small number of observations, the small K in K-fold CV, the small number of posterior draws, and the small number of projected draws / clusters. All of these have been chosen so small because of the high computational cost associated with higher values.

fweber144 commented 2 months ago

I just pushed several commits with minor tweaks concerning progressr (commits 71a2198f791fe837853a855dc07a380d4e1ceac2 to 0996e7c476ba68d940a834314f3cbf2beebe097d). The new global option projpred.use_progressr still needs to be documented. And if possible, we should also add tests for progressr.

I also pushed two other commits: 8488e3eeddfd376b8ba623bc761c8bed46229857 and 118838cf1b30836422ccdf3ef1f1fdb5fa6c7e68. These are the result from my tries to reproduce the issue you reported in https://github.com/stan-dev/projpred/pull/496#discussion_r1579209073. I finally managed to reproduce it using

data("df_gaussian", package = "projpred")
df_gaussian <- df_gaussian[1:29, ]
dat <- data.frame(y = df_gaussian$y, df_gaussian$x)
rfit <- rstanarm::stan_glm(y ~ X1 + X2 + X3 + X4 + X5,
                           data = dat,
                           chains = 1,
                           iter = 500,
                           seed = 1140350788,
                           refresh = 0)

devtools::load_all(".")
# library(projpred)

library(progressr)
handlers(global = TRUE)
doFuture::registerDoFuture()
export_default <- options(doFuture.foreach.export = ".export")
ncores <- 4L
future::plan(future::multisession, workers = ncores)
stopifnot(identical(foreach::getDoParWorkers(), ncores))
cvvs_K_par <- cv_varsel(rfit, cv_method = "kfold", K = 4, seed = 5602,
                        parallel = TRUE)

Most importantly, the .select <- .select lines are not needed when the package is installed and then attached via library(projpred) (I guess loading should also suffice). So I guess the issue is caused by devtools::load_all(). Since users will never use devtools::load_all(), I have removed those lines now.

The do_call() issue mentioned in commit 8488e3eeddfd376b8ba623bc761c8bed46229857 did not go away when installing projpred instead of using devtools::load_all(). But I found the .package solution, which I think is the best solution we can come up with.

fweber144 commented 2 months ago

Most importantly, the .select <- .select lines are not needed when the package is installed and then attached via library(projpred) (I guess loading should also suffice). So I guess the issue is caused by devtools::load_all(). Since users will never use devtools::load_all(), I have removed those lines now.

The do_call() issue mentioned in commit 8488e3e did not go away when installing projpred instead of using devtools::load_all(). But I found the .package solution, which I think is the best solution we can come up with.

I forgot to mention that I removed the .select <- .select lines after having introduced the .package solution for the do_call() issue, so perhaps the .package solution also took part in resolving the .select <- .select issue.

avehtari commented 1 month ago

I'm still unable to run tests due to the error (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate

avehtari commented 1 month ago

PIRLS error comes from setup.R line 1424 which eventually calls gamm4() with

projpred_formula_no_random = .y_gamm_brnll.1 ~ xco.1 + xca.1 + s(s.1)
projpred_random = ~(xco.1 | z.1)
Family: binomial
Link function: logit

I also have the specific data if that helps. I have not been able to figure out what goes wrong.

avehtari commented 1 month ago

The tests run if I change in setup.R

nobsv <- 41L

to

nobsv <- 43L

but then of course all the results change. But this shows it's a problem with the small nobsv and random data

avehtari commented 1 month ago

It seems I have not understood the logic of delta and diffs in plot.vsel and summary.vsel as when I tried to fix the summary.vsel, the plot.vsel is now messed.

avehtari commented 1 month ago

After your changes progressr doesn't work for me. use_progressr() returns TRUE and I have verbose=TRUE, but I don't get any progress bar.

fweber144 commented 1 month ago

After your changes progressr doesn't work for me. use_progressr() returns TRUE and I have verbose=TRUE, but I don't get any progress bar.

I removed the handlers() calls that you introduced into projpred in commit 71a2198f791fe837853a855dc07a380d4e1ceac2 because of the guidelines given in https://progressr.futureverse.org/articles/progressr-intro.html#two-minimal-apis---one-for-developers-and-one-for-end-users. According to those guidelines, the user should call progressr::handlers() him- or herself, e.g., via handlers(global = TRUE). Do you have such a handlers() call? If yes (and it still does not work), could you provide a reprex?

fweber144 commented 1 month ago

The tests run if I change in setup.R
nobsv <- 41L
to
nobsv <- 43L
but then of course all the results change. But this shows it's a problem with the small nobsv and random data

Yes, we definitely need to make the tests more robust, as explained in https://github.com/stan-dev/projpred/pull/496#issuecomment-2198682118. Perhaps we can talk about that when also discussing the deltas and diff thing.

avehtari commented 1 month ago

Re: progressr, thanks, works now. I read the doc yesterday and still missed that I need to call handlers() (and I have no memory of how I implemented it the first time)

fweber144 commented 1 month ago

Concerning https://github.com/stan-dev/projpred/pull/496#issuecomment-2185043711, I am now done with reviewing the missing files (cv_varsel.R, DESCRIPTION, and all of the test files, although I have not run the latter yet). Concerning https://github.com/stan-dev/projpred/pull/496#issuecomment-2185065683, I think I could finish reviewing this PR quicker if we could move out the R2 changes into a separate PR. Same might hold for moving out the "bootstrap → delta method" change (for RMSE) into a separate PR. Another reason for moving out these parts into separate PRs is that this makes comparing the snapshots from the unit tests to their former variants easier for this PR.

avehtari commented 1 month ago

Concerning #496 (comment), I am now done with reviewing the missing files (cv_varsel.R, DESCRIPTION, and all of the test files, although I have not run the latter yet). Concerning #496 (comment), I think I could finish reviewing this PR quicker if we could move out the R2 changes into a separate PR. Same might hold for moving out the "bootstrap → delta method" change (for RMSE) into a separate PR. Another reason for moving out these parts into separate PRs is that this makes comparing the snapshots from the unit tests to their former variants easier for this PR.

It's not just moving the changes to separate PR, but it would require implementing sub-sampling-loo again for the old approaches. I would prefer not implementing sub-sampling-loo with the old approaches just to be able to make two PRs. As adding sub-sampling-loo added complexity, I wanted to simplify the code. All the tests passed after my changes, so there were no differences (within tolerance) in the results for non-sub-sampling-loo summaries. I think separating these would practically mean to implementing sub-sampling-loo again from scratch for the code in the master branch, and after this PR would have been merged, then replacing that code with the code now in this PR. But if this PR will not progress otherwise, then I'll do it.

fweber144 commented 1 month ago

Sorry, I did not want to sound as if I would not finish reviewing this PR, I was only concerned about the time it takes.

You're right that we would have to choose an alternative approach for the RMSE, but as a temporary solution, I would have disallowed the RMSE for subsampled LOO-CV (or used the full validate_search = FALSE results, as for the AUC). But if you prefer to keep the delta method in this PR, I'm fine with it.

avehtari commented 1 month ago

Sorry, I did not want to sound as if I would not finish reviewing this PR, I was only concerned about the time it takes.

I'm also concerned about your time usage, as you are doing this in your own time and I don't want to burden you. Just say what I should do, to save your time sufficiently so that you would be willing to finish helping to get this merged.

I now see that probably should have first made cleaning PRs and then only the subsampling-LOO fix. Now I started correcting the faulty subsampling and then made other improvements at the same time making the PR to have more changes.

fweber144 commented 1 month ago

Just say what I should do, to save your time sufficiently so that you would be willing to finish helping to get this merged.

Thanks, I'll let you know if I get stuck.

fweber144 commented 1 month ago

A question related to your comment here: Is it mentioned in Magnusson et al. (2020) that the variances from eq. (8) and eq. (9) need to be summed? I couldn't find such a statement at the first glance, although it would make sense. By the way, in section 2.3 of the paper, point 5 says

Estimate $\mathrm{elpd}_D$ and $V (\mathrm{elpd}_D )$ using Eq. (7) and (8).

which is ignoring eq. (9), but this might be a typo.

fweber144 commented 1 month ago

Commit a7458b98f8efbf255966c29769ac6fe08ce9253a reverts the changes with respect to the new "mixed deltas" variant of plot.vsel(), the new column behavior of summary.vsel(), and the omittance of option "best" of argument baseline (the latter is now re-introduced, but disallowed for subsampled LOO-CV).

Now I still need to do points 1 and 3 from https://github.com/stan-dev/projpred/pull/496#issuecomment-2185065683 and, as mentioned in https://github.com/stan-dev/projpred/pull/496#issuecomment-2185043711, some other things are still missing (documentation, vignettes, R CMD check (including unit tests), NEWS file), apart from open discussions here.

fweber144 commented 1 month ago

Commit a7458b9 reverts the changes with respect to the new "mixed deltas" variant of plot.vsel(), the new column behavior of summary.vsel(), and the omittance of option "best" of argument baseline (the latter is now re-introduced, but disallowed for subsampled LOO-CV).

Branch mixed_deltas_plot holds the state before reverting those things. I'll try to merge any future commits that are unrelated to those features into mixed_deltas_plot.

avehtari commented 1 month ago

A question related to your comment here: Is it mentioned in Magnusson et al. (2020) that the variances from eq. (8) and eq. (9) need to be summed? I couldn't find such a statement at the first glance, although it would make sense.

It is not mentioned there, but it is required to get the appropriate variance for the plots. I first followed the paper, and then the variance can be really small, and then thinking it through realized that at least in our use case we need to add them together.

avehtari commented 1 month ago

btw. I had kept loo_inds for future use. Eventually it would be useful to be able to run sub-sampling-LOO first with e.g. nloo=50, and then add another 50 if the variances are too big.

fweber144 commented 1 month ago

btw. I had kept loo_inds for future use. Eventually it would be useful to be able to run sub-sampling-LOO first with e.g. nloo=50, and then add another 50 if the variances are too big.

Makes sense. Did I break something with respect to loo_inds? If yes, that was not intentional.

avehtari commented 1 month ago

Makes sense. Did I break something with respect to loo_inds? If yes, that was not intentional.

It's not used, but I thought it's better to keep it there until the incremental sub-sampling is supported, and then decide whether to keep depending on how the incremental sub-sampling is implemented.

fweber144 commented 1 month ago

It's not used

Yes, it wasn't, but in commit 19948e33b3a2c4ce6c90afc4552b8004078076e6, I changed that (note that there is also a fixup commit: 519dac268d130a593713f75ed6f5370b25392be0). I think using loo_inds in get_stat() is safer than relying on NAs (remember how hard a time we had finding out under which circumstances NAs can occur).

avehtari commented 1 month ago

When I changed several bootstraps to analytic approximations and improved other approximations, I thought the math I was writing in the code was so trivial that I didn't write all the derivations and assumptions separately. Now I see I should have done that, as it takes also me a bit of time to re-check any of these when you ask a question, so they are not as obvious as I thought them to be. If you like, I can some day write the equations and assumptions for easier checking. Before that, at least every approximation I wrote is based on the tests at least as accurate as the earlier bootstrap, but much faster.

fweber144 commented 1 month ago

The tests run if I change in setup.R
nobsv <- 41L
to
nobsv <- 43L
but then of course all the results change. But this shows it's a problem with the small nobsv and random data

For me,

nobsv <- 43L

does not work (runs into some error, similarly as nobsv <- 41L did for you). However,

nobsv <- 39L

works for me. Does it work for you as well (on master)? Then I would pick that for the time being. Ultimately, it would be desirable to completely revise the tests because currently, we mainly test the "big" user-level functions, with the tests for all the "small" internal functions being quite short or not existing at all. The principle of testing should rather be to test the underlying functions extensively, because then it is easier to keep the tests for the "big" (and hence slow) user-level functions short.

avehtari commented 4 weeks ago

With nobsv <- 39L I get [ FAIL 0 | WARN 902 | SKIP 2 | PASS 60545 ]

fweber144 commented 2 weeks ago

With nobsv <- 39L I get [ FAIL 0 | WARN 902 | SKIP 2 | PASS 60545 ]

That sounds good. The warnings probably arise from the first creation of the snapshots. If you are running the tests via R CMD check, then from the second run on, you can avoid these warnings by (locally) removing the entry ^tests/testthat/bfits$ from the .Rbuildignore file. For the two skips, I don't know where they are coming from, but they are probably due to a suggested package that is not installed.

Since this solution seems to be working for you, I will push a commit to master (and merge it here) that changes nobsv to 39. As mentioned above, this is only a quick workaround.

fweber144 commented 1 week ago

While working on the tests, I discovered a few bugs and fixed them. However (apart from what is still open in the comments above), we still need to think about the following:

In cv_varsel.vsel(), is it correct to treat summaries_fast like the search-related arguments? Or do we need to set summaries_fast to NULL if not usable, just like rk_foldwise?
In case of not-subsampled LOO, I think we can "usually" (for latent projection, this requires a well-behaving family$latent_ilink() function) expect no NAs in reference and standard submodel summaries (mu and lppd). In case of subsampled LOO, I think we can expect no NAs in reference and fast submodel summaries. So in such cases, should (unexpected) NAs in the corresponding summaries be left as-is so that downstream code "sees" them? What I have in mind is for example whether we can remove the na.rm = TRUE part from https://github.com/stan-dev/projpred/blob/7469aeace5f9795d4c2f6ff608913e807c70a8f0/R/summary_funs.R#L305-L306 so that the default of na.rm = FALSE is used.

For the tests, I think it would make sense to add subsampled-LOO cv_varsel() calls already in setup.R so that all post-processing functions are tested for such subsampled-LOO cv_varsel() output. When doing so, we should pay attention to cover augmented-data and latent projection cases.

Finally, while experimenting with the tests, I discovered some strange NAs in the summaries_sub (of .tabulate_stats()) after running https://github.com/stan-dev/projpred/blob/7469aeace5f9795d4c2f6ff608913e807c70a8f0/tests/testthat/test_varsel.R#L1452-L1456 (slightly adapted, namely using excl_nonargs(args_cvvs_i, nms_excl_add = "validate_search")) on tstsetup <- "rstanarm.glm.cumul.stdformul.without_wobs.with_offs.latent.default_meth.default_cvmeth.default_search_trms". This needs to be investigated.

stan-dev / projpred

subsampling LOO estimates with diff-est-srs-wor start #496