cross-validation and reproducibility

Ali777927 commented 3 years ago

Dear Prof. Breheny,

First of all, I would like to express my gratitude for all your efforts and scientific contributions in the field of regularization and penalized regressions, we are so grateful for all of your efforts.

I am performing a linear MCP regression with around 2000 observations and around 400 potential predictors. Although I added the option “seed”, I still have troubles in getting reproducible results (and I am not able to get my code run with filling the option “fold”). In this regard I have some specific questions:

What would be a perfect “nfold” number for my dataset? What is the maximum I should try? Is increasing the “nfold” number to up to 100 or even 300 would cause “bias”?
Is repeating the analysis multiple times and then choosing the average lambda of the multiple analyses a reasonable option ? (I read this solution in one of your presentations if I understand it correctly)
To choose the best lambda, we suggested the following criteria: we increased the “nfold” number till the lambda gets stabilized and we get the same results three times in a row. Yet, it is rarely that I get the same results three times in a row (even after using “seed”). Do you agree with me that this is not a good approach?

Hopefully, you can help me to get an appropriate solution for my problem. Thanks in advance.

pbreheny commented 3 years ago

In terms of code,

I am unable to replicate your result in the sense that I always get the exact same results with the same seed:

library(ncvreg)
data(Prostate)
cvfit1 <- cv.ncvreg(Prostate$X, Prostate$y, seed=1)
cvfit2 <- cv.ncvreg(Prostate$X, Prostate$y, seed=1)
cvfit3 <- cv.ncvreg(Prostate$X, Prostate$y, seed=1)
identical(cvfit1$fold, cvfit2$fold)
identical(cvfit1$fold, cvfit3$fold)

What version are you running?

"I am not able to get my code run with filling the option 'fold'"...why not? What error do you see?

In terms of statistical practice:

The maximum number of folds is n, the number of observations. This is known as leave-one-out cross-validation. It is an attractive approach, as you always get the same result (there is only one way of assigning folds, so it is not random). The main reason people don't do it is that one would have to fit the model 2000 times (or whatever n is).
Repeated cross validation is certainly something that people do, but I would say that if you're willing to fit, say, 100-fold cross validation 10 times, then why not just do leave-one-out cross validation?
Fundamentally, the issue is that there is uncertainty about lambda, and a range of values that are consistent with the data. There are many approaches that are reasonable here and result in picking a lambda value that works well. Presumably something like model averaging that accounts for uncertainty in the lambda values would be better, but is more complicated.

Ali777927 commented 3 years ago

Dear Prof. Breheny,

Thank you very much for your detailed answers, I appreciate that a lot.

Regarding the code:

I think I know now why my results were not reproducible, I was using the same seed, but different nfolds. It is then very logical to get different results (like the code below). I am sorry that I didn't make that clear in my post (I had the idea that when I would use the same seed, it should give me the same results). library(ncvreg) data(Prostate) cvfit1 <- cv.ncvreg(Prostate$X, Prostate$y, seed=1, nfolds = 12) cvfit2 <- cv.ncvreg(Prostate$X, Prostate$y, seed=1, nfold = 23) identical(cvfit1$fold, cvfit2$fold)

2.Regarding the point of "fold", I am using 3.12.0 version. I get the message "R session aborted" and R stops. I believe that I not writing the code well, I don't fully understand the sentence " Which fold each observation belongs to." and how to adjust the code in that regard. My apologies, I have around one year experience in R with medical background, and I didn't persist to get that done correctly yet (and that is not necessary now, the default option is already nice). I get the aforementioned message when writing the code like this:

cvfit1 <- cv.ncvreg(Prostate$X, Prostate$y, seed=1, fold =3, nfolds = 12)

Regarding the points about nfold, I really appreciate your tips and comments, thank you so much. I finally got the same exact results by using LOOCV. Actually, I read previously about LOOCV but I was hesitant to use it because of the different comments I read on the web about its effects on the variance (they said it increases the variance), although this is debatable (and might not per se apply to the settings where we do CV for tuning parameters).

Best wishes (and Happy new year !), Ali

pbreheny commented 3 years ago

The argument fold should be a vector of length n that describes which fold each observation is assigned to. For example, if n=9:

fold = c(1,1,1,2,2,2,3,3,3)

would assign the first three observations to the first fold, and so on. Also, you can look at fit$fold to see how folds were assigned and the argument should look like.

As far as the supposed increase in variance when using LOOCV, yes, this is often claimed, but I have never seen convincing evidence that this is true in general. At best, the statement is an oversimplification.

pbreheny / ncvreg

cross-validation and reproducibility #23