Open avehtari opened 5 months ago
I wanted to use R2, and as I had rewrote summary stats anyway, added R2 and made all mse, rmse, and R2 to use only normal approximation with as much shared computation as possible
With the added R2 support, this PR will close also #483
I can take a look
I have added some comments, but I'm not done with the review yet.
Besides, I think documentation needs to be updated (at least re-roxygenized, but also progressr should be mentioned), the vignettes perhaps as well, and I haven't run R CMD check
(including the unit tests) yet. The NEWS
file would also need to be updated.
Thanks for all the comments.
I have run unit tests, but not yet R CMD check. I had not updated all docs, as I wanted to get approval on the behavior changes first. I'll start working on the doc changes next.
Merged master which included new pareto-k diagnostics. There were several conflicts in cv_varsel.R, and I hope I did resolve them correctly
re-running tests after the merge
This is a new one
Error in
FUN()
: ! In path: "./projpred/tests/testthat/setup.R" Caused by error: ! (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
I get the same PIRLS error with the master branch, so it's not just this PR
That's strange. For me, the tests succeed on master
. Can you locate where this error occurs in setup.R
(and for which model)?
traceback says line 1423 in setup.R. I tried to run line by line, but it's difficult as somehow some variables controlling what is executed keep changing to FALSE (e.g. run_vs and run_cvvs) and I'm not able to repeat what is excuted when calling devtools::test(), and now I did run out of time for today
I have now reviewed files glmfun.R
, methods.R
, misc.R
, summary_funs.R
, and varsel.R
.
File cv_varsel.R
(as well as DESCRIPTION
and all of the test files) is still missing. Before proceeding with cv_varsel.R
, I think I should read Magnusson et al. (2020) again because I read it long ago and have forgotten about many details. Also, as mentioned above (just for keeping track of this), some other things are still missing (documentation, vignettes, R CMD check
(including unit tests), NEWS
file).
Sorry, I forgot to mention that in summary_funs.R
, I have not reviewed several parts yet:
# Compute mean and variance in log scale by matching the variance of a
# log-normal approximation
# https://en.wikipedia.org/wiki/Log-normal_distribution#Arithmetic_moments
.srs_diff_est_w()
as well as all of its occurrences.else if (stat %in% c("mse", "rmse", "R2")) {
Reviewing these parts probably also makes sense only after having read Magnusson et al. (2020). Reviewing R2 and the changes for MSE and RMSE also requires some more time, I guess.
2. The definition of
.srs_diff_est_w()
as well as all of its occurrences. ... Reviewing these parts probably also makes sense only after having read Magnusson et al. (2020). Reviewing R2 and the changes for MSE and RMSE also requires some more time, I
Note that there is one typo in the paper as mentioned in https://github.com/stan-dev/loo/pull/238
I trust all the new delta-method and matching variance parts as the results do match well with the bootstrap results
Note that there is one typo in the paper as mentioned in https://github.com/stan-dev/loo/pull/238
Thanks. I'll take a look at that, too.
Concerning the tests: I just pushed commit 28e011789ea2b5ad2743c7821bc88b2ca2c76a77 to master
because after upgrading CmdStan to v2.35.0, I received an error at a place where a tolerance was set too tight. Apart from that, the tests on branch master
ran through on my machine. But since in general, Stan results depend on the machine, it could be that you are encountering an error while I don't. Nevertheless, this shows how fragile the tests currently are (especially for GLMM submodels where I think you are encountering the error; GAMs and GAMMs are probably similarly fragile). In the future, we could try to make the tests more stable. I think an important reason for the current instability is the small number of observations, the small K in K-fold CV, the small number of posterior draws, and the small number of projected draws / clusters. All of these have been chosen so small because of the high computational cost associated with higher values.
I just pushed several commits with minor tweaks concerning progressr (commits 71a2198f791fe837853a855dc07a380d4e1ceac2 to 0996e7c476ba68d940a834314f3cbf2beebe097d). The new global option projpred.use_progressr
still needs to be documented. And if possible, we should also add tests for progressr.
I also pushed two other commits: 8488e3eeddfd376b8ba623bc761c8bed46229857 and 118838cf1b30836422ccdf3ef1f1fdb5fa6c7e68. These are the result from my tries to reproduce the issue you reported in https://github.com/stan-dev/projpred/pull/496#discussion_r1579209073. I finally managed to reproduce it using
data("df_gaussian", package = "projpred")
df_gaussian <- df_gaussian[1:29, ]
dat <- data.frame(y = df_gaussian$y, df_gaussian$x)
rfit <- rstanarm::stan_glm(y ~ X1 + X2 + X3 + X4 + X5,
data = dat,
chains = 1,
iter = 500,
seed = 1140350788,
refresh = 0)
devtools::load_all(".")
# library(projpred)
library(progressr)
handlers(global = TRUE)
doFuture::registerDoFuture()
export_default <- options(doFuture.foreach.export = ".export")
ncores <- 4L
future::plan(future::multisession, workers = ncores)
stopifnot(identical(foreach::getDoParWorkers(), ncores))
cvvs_K_par <- cv_varsel(rfit, cv_method = "kfold", K = 4, seed = 5602,
parallel = TRUE)
Most importantly, the .select <- .select
lines are not needed when the package is installed and then attached via library(projpred)
(I guess loading should also suffice). So I guess the issue is caused by devtools::load_all()
. Since users will never use devtools::load_all()
, I have removed those lines now.
The do_call()
issue mentioned in commit 8488e3eeddfd376b8ba623bc761c8bed46229857 did not go away when installing projpred instead of using devtools::load_all()
. But I found the .package
solution, which I think is the best solution we can come up with.
Most importantly, the
.select <- .select
lines are not needed when the package is installed and then attached vialibrary(projpred)
(I guess loading should also suffice). So I guess the issue is caused bydevtools::load_all()
. Since users will never usedevtools::load_all()
, I have removed those lines now.The
do_call()
issue mentioned in commit 8488e3e did not go away when installing projpred instead of usingdevtools::load_all()
. But I found the.package
solution, which I think is the best solution we can come up with.
I forgot to mention that I removed the .select <- .select
lines after having introduced the .package
solution for the do_call()
issue, so perhaps the .package
solution also took part in resolving the .select <- .select
issue.
I'm still unable to run tests due to the error (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
PIRLS error comes from setup.R
line 1424 which eventually calls gamm4()
with
projpred_formula_no_random = .y_gamm_brnll.1 ~ xco.1 + xca.1 + s(s.1)
projpred_random = ~(xco.1 | z.1)
I also have the specific data if that helps. I have not been able to figure out what goes wrong.
The tests run if I change in setup.R
nobsv <- 41L
to
nobsv <- 43L
but then of course all the results change. But this shows it's a problem with the small nobsv
and random data
It seems I have not understood the logic of delta and diffs in plot.vsel and summary.vsel as when I tried to fix the summary.vsel, the plot.vsel is now messed.
After your changes progressr
doesn't work for me. use_progressr()
returns TRUE
and I have verbose=TRUE
, but I don't get any progress bar.
After your changes
progressr
doesn't work for me.use_progressr()
returnsTRUE
and I haveverbose=TRUE
, but I don't get any progress bar.
I removed the handlers()
calls that you introduced into projpred in commit 71a2198f791fe837853a855dc07a380d4e1ceac2 because of the guidelines given in https://progressr.futureverse.org/articles/progressr-intro.html#two-minimal-apis---one-for-developers-and-one-for-end-users. According to those guidelines, the user should call progressr::handlers()
him- or herself, e.g., via handlers(global = TRUE)
. Do you have such a handlers()
call? If yes (and it still does not work), could you provide a reprex?
The tests run if I change in
setup.R
nobsv <- 41L
to
nobsv <- 43L
but then of course all the results change. But this shows it's a problem with the small
nobsv
and random data
Yes, we definitely need to make the tests more robust, as explained in https://github.com/stan-dev/projpred/pull/496#issuecomment-2198682118. Perhaps we can talk about that when also discussing the deltas
and diff
thing.
Re: progressr, thanks, works now. I read the doc yesterday and still missed that I need to call handlers()
(and I have no memory of how I implemented it the first time)
Concerning https://github.com/stan-dev/projpred/pull/496#issuecomment-2185043711, I am now done with reviewing the missing files (cv_varsel.R
, DESCRIPTION
, and all of the test files, although I have not run the latter yet). Concerning https://github.com/stan-dev/projpred/pull/496#issuecomment-2185065683, I think I could finish reviewing this PR quicker if we could move out the R2 changes into a separate PR. Same might hold for moving out the "bootstrap → delta method" change (for RMSE) into a separate PR. Another reason for moving out these parts into separate PRs is that this makes comparing the snapshots from the unit tests to their former variants easier for this PR.
Concerning #496 (comment), I am now done with reviewing the missing files (
cv_varsel.R
,DESCRIPTION
, and all of the test files, although I have not run the latter yet). Concerning #496 (comment), I think I could finish reviewing this PR quicker if we could move out the R2 changes into a separate PR. Same might hold for moving out the "bootstrap → delta method" change (for RMSE) into a separate PR. Another reason for moving out these parts into separate PRs is that this makes comparing the snapshots from the unit tests to their former variants easier for this PR.
It's not just moving the changes to separate PR, but it would require implementing sub-sampling-loo again for the old approaches. I would prefer not implementing sub-sampling-loo with the old approaches just to be able to make two PRs. As adding sub-sampling-loo added complexity, I wanted to simplify the code. All the tests passed after my changes, so there were no differences (within tolerance) in the results for non-sub-sampling-loo summaries. I think separating these would practically mean to implementing sub-sampling-loo again from scratch for the code in the master branch, and after this PR would have been merged, then replacing that code with the code now in this PR. But if this PR will not progress otherwise, then I'll do it.
Sorry, I did not want to sound as if I would not finish reviewing this PR, I was only concerned about the time it takes.
You're right that we would have to choose an alternative approach for the RMSE, but as a temporary solution, I would have disallowed the RMSE for subsampled LOO-CV (or used the full validate_search = FALSE
results, as for the AUC). But if you prefer to keep the delta method in this PR, I'm fine with it.
Sorry, I did not want to sound as if I would not finish reviewing this PR, I was only concerned about the time it takes.
I'm also concerned about your time usage, as you are doing this in your own time and I don't want to burden you. Just say what I should do, to save your time sufficiently so that you would be willing to finish helping to get this merged.
I now see that probably should have first made cleaning PRs and then only the subsampling-LOO fix. Now I started correcting the faulty subsampling and then made other improvements at the same time making the PR to have more changes.
Just say what I should do, to save your time sufficiently so that you would be willing to finish helping to get this merged.
Thanks, I'll let you know if I get stuck.
A question related to your comment here: Is it mentioned in Magnusson et al. (2020) that the variances from eq. (8) and eq. (9) need to be summed? I couldn't find such a statement at the first glance, although it would make sense. By the way, in section 2.3 of the paper, point 5 says
Estimate $\mathrm{elpd}_D$ and $V (\mathrm{elpd}_D )$ using Eq. (7) and (8).
which is ignoring eq. (9), but this might be a typo.
Commit a7458b98f8efbf255966c29769ac6fe08ce9253a reverts the changes with respect to the new "mixed deltas" variant of plot.vsel()
, the new column behavior of summary.vsel()
, and the omittance of option "best"
of argument baseline
(the latter is now re-introduced, but disallowed for subsampled LOO-CV).
Now I still need to do points 1 and 3 from https://github.com/stan-dev/projpred/pull/496#issuecomment-2185065683 and, as mentioned in https://github.com/stan-dev/projpred/pull/496#issuecomment-2185043711, some other things are still missing (documentation, vignettes, R CMD check
(including unit tests), NEWS
file), apart from open discussions here.
Commit a7458b9 reverts the changes with respect to the new "mixed deltas" variant of
plot.vsel()
, the new column behavior ofsummary.vsel()
, and the omittance of option"best"
of argumentbaseline
(the latter is now re-introduced, but disallowed for subsampled LOO-CV).
Branch mixed_deltas_plot
holds the state before reverting those things. I'll try to merge any future commits that are unrelated to those features into mixed_deltas_plot
.
A question related to your comment here: Is it mentioned in Magnusson et al. (2020) that the variances from eq. (8) and eq. (9) need to be summed? I couldn't find such a statement at the first glance, although it would make sense.
It is not mentioned there, but it is required to get the appropriate variance for the plots. I first followed the paper, and then the variance can be really small, and then thinking it through realized that at least in our use case we need to add them together.
btw. I had kept loo_inds
for future use. Eventually it would be useful to be able to run sub-sampling-LOO first with e.g. nloo=50, and then add another 50 if the variances are too big.
btw. I had kept
loo_inds
for future use. Eventually it would be useful to be able to run sub-sampling-LOO first with e.g. nloo=50, and then add another 50 if the variances are too big.
Makes sense. Did I break something with respect to loo_inds
? If yes, that was not intentional.
Makes sense. Did I break something with respect to loo_inds? If yes, that was not intentional.
It's not used, but I thought it's better to keep it there until the incremental sub-sampling is supported, and then decide whether to keep depending on how the incremental sub-sampling is implemented.
It's not used
Yes, it wasn't, but in commit 19948e33b3a2c4ce6c90afc4552b8004078076e6, I changed that (note that there is also a fixup commit: 519dac268d130a593713f75ed6f5370b25392be0). I think using loo_inds
in get_stat()
is safer than relying on NA
s (remember how hard a time we had finding out under which circumstances NA
s can occur).
When I changed several bootstraps to analytic approximations and improved other approximations, I thought the math I was writing in the code was so trivial that I didn't write all the derivations and assumptions separately. Now I see I should have done that, as it takes also me a bit of time to re-check any of these when you ask a question, so they are not as obvious as I thought them to be. If you like, I can some day write the equations and assumptions for easier checking. Before that, at least every approximation I wrote is based on the tests at least as accurate as the earlier bootstrap, but much faster.
The tests run if I change in
setup.R
nobsv <- 41L
to
nobsv <- 43L
but then of course all the results change. But this shows it's a problem with the small
nobsv
and random data
For me,
nobsv <- 43L
does not work (runs into some error, similarly as nobsv <- 41L
did for you). However,
nobsv <- 39L
works for me. Does it work for you as well (on master
)? Then I would pick that for the time being. Ultimately, it would be desirable to completely revise the tests because currently, we mainly test the "big" user-level functions, with the tests for all the "small" internal functions being quite short or not existing at all. The principle of testing should rather be to test the underlying functions extensively, because then it is easier to keep the tests for the "big" (and hence slow) user-level functions short.
With nobsv <- 39L
I get [ FAIL 0 | WARN 902 | SKIP 2 | PASS 60545 ]
With
nobsv <- 39L
I get[ FAIL 0 | WARN 902 | SKIP 2 | PASS 60545 ]
That sounds good. The warnings probably arise from the first creation of the snapshots. If you are running the tests via R CMD check
, then from the second run on, you can avoid these warnings by (locally) removing the entry ^tests/testthat/bfits$
from the .Rbuildignore
file. For the two skips, I don't know where they are coming from, but they are probably due to a suggested package that is not installed.
Since this solution seems to be working for you, I will push a commit to master
(and merge it here) that changes nobsv
to 39
. As mentioned above, this is only a quick workaround.
While working on the tests, I discovered a few bugs and fixed them. However (apart from what is still open in the comments above), we still need to think about the following:
cv_varsel.vsel()
, is it correct to treat summaries_fast
like the search-related arguments? Or do we need to set summaries_fast
to NULL
if not usable, just like rk_foldwise
?family$latent_ilink()
function) expect no NA
s in reference and standard submodel summaries (mu
and lppd
). In case of subsampled LOO, I think we can expect no NA
s in reference and fast submodel summaries. So in such cases, should (unexpected) NA
s in the corresponding summaries be left as-is so that downstream code "sees" them? What I have in mind is for example whether we can remove the na.rm = TRUE
part from https://github.com/stan-dev/projpred/blob/7469aeace5f9795d4c2f6ff608913e807c70a8f0/R/summary_funs.R#L305-L306 so that the default of na.rm = FALSE
is used.For the tests, I think it would make sense to add subsampled-LOO cv_varsel()
calls already in setup.R
so that all post-processing functions are tested for such subsampled-LOO cv_varsel()
output. When doing so, we should pay attention to cover augmented-data and latent projection cases.
Finally, while experimenting with the tests, I discovered some strange NA
s in the summaries_sub
(of .tabulate_stats()
) after running https://github.com/stan-dev/projpred/blob/7469aeace5f9795d4c2f6ff608913e807c70a8f0/tests/testthat/test_varsel.R#L1452-L1456 (slightly adapted, namely using excl_nonargs(args_cvvs_i, nms_excl_add = "validate_search")
) on tstsetup <- "rstanarm.glm.cumul.stdformul.without_wobs.with_offs.latent.default_meth.default_cvmeth.default_search_trms"
. This needs to be investigated.
Work in progress.
Tested with
Notes
cv_varsel
if nloo<n and fast PSIS-LOO result is not yet available, fast PSIS-LOO result is computedcv_varsel
if nloo<n, fast PSIS-LOO result is stored in slot $summaries_fastNext
tagging @n-kall