Repeated cross-validation does not decrease standard error

darentsai commented 3 months ago

Dear maintainers,

I'm using this excellent package for model assessment. The following example demonstrates my issue.

library(riskRegression)

set.seed(10)
learndat = sampleData(400,outcome = "binary")
lr1a = glm(Y ~ X6,data = learndat, family = binomial)
lr2a = glm(Y ~ X7 + X8 + X9, data = learndat, family = binomial)

I use 5-fold CV with AUC as the performance metric, and the standard errors for both models are around 0.0258. When I repeat it for 100 times, the standard errors and confidence intervals are almost the same. I think the repeated CV will make the estimator more stable. But the standard error did not decrease significantly.

5-fold cross-validation

sc1 = Score(list("LR1" = lr1a, "LR2" = lr2a), formula = Y ~ 1, data = learndat,
            split.method = "cv5", B = 1)
sc1$AUC$score
#     model       AUC         se     lower     upper
#    <fctr>     <num>      <num>     <num>     <num>
# 1:    LR1 0.7085140 0.02582834 0.6578913 0.7591366
# 2:    LR2 0.7179151 0.02588211 0.6671871 0.7686431

Repeated 5-fold cross-validation with 100 repetitions

sc100 = Score(list("LR1" = lr1a, "LR2" = lr2a), formula = Y ~ 1, data = learndat,
              split.method = "cv5", B = 100)
sc100$AUC$score
#         model       AUC         se    lower     upper
#        <fctr>     <num>      <num>    <num>     <num>
# 1: Null model 0.5000000 0.00000000 0.500000 0.5000000
# 2:        LR1 0.7062342 0.02569907 0.655865 0.7566035
# 3:        LR2 0.7208975 0.02542573 0.671064 0.7707310

tagteam commented 3 months ago

well, the magnitude of the standard error depends on a) the statistical uncertainty of your prediction model algorithm (here simple pre-specified glm) in the current training data and the uncertainty of the estimate of the AUC in the level-one data. both do not depend on the number of repetitions. repeating cv-5 decreases the random seed dependence of the result but does not affect the standard error.

T

Darren Tsai @.***> writes:

Dear maintainers,

I'm using this excellent package for model assessment. The following example demonstrates my issue.
library(riskRegression)

set.seed(10)
learndat = sampleData(400,outcome = "binary")
lr1a = glm(Y ~ X6,data = learndat, family = binomial)
lr2a = glm(Y ~ X7 + X8 + X9, data = learndat, family = binomial)
I use 5-fold CV with AUC as the performance metric, and the standard errors for both models are around 0.0258. When I repeat it for 100 times, the standard errors and confidence intervals are almost the same. I think the repeated CV will make the estimator more stable. But the standard error did not decrease significantly.

5-fold cross-validation
sc1 = Score(list("LR1" = lr1a, "LR2" = lr2a), formula = Y ~ 1, data =
learndat,
            split.method = "cv5", B = 1)
sc1$AUC$score
#     model       AUC         se     lower     upper
#    <fctr>     <num>      <num>     <num>     <num>
# 1:    LR1 0.7085140 0.02582834 0.6578913 0.7591366
# 2:    LR2 0.7179151 0.02588211 0.6671871 0.7686431
Repeated 5-fold cross-validation with 100 repetitions
sc100 = Score(list("LR1" = lr1a, "LR2" = lr2a), formula = Y ~ 1, data
= learndat,
              split.method = "cv5", B = 100)
sc100$AUC$score
#         model       AUC         se    lower     upper
#        <fctr>     <num>      <num>    <num>     <num>
# 1: Null model 0.5000000 0.00000000 0.500000 0.5000000
# 2:        LR1 0.7062342 0.02569907 0.655865 0.7566035
# 3:        LR2 0.7208975 0.02542573 0.671064 0.7707310

-- 7LL-3 Time heals almost everything, give the time, some time.

darentsai commented 3 months ago

Thanks for your feedback! As far as I know, repeating the k-fold cross-validation reduces the variability of the performance estimate, leading to a more stable and reliable assessment.

For a 5-fold CV, the final AUC is just the average of 5 AUC from each fold.

$$ \hat{AUC} = \frac{1}{5} \sum{i=1}^{5} AUC{fold.i} $$

But for a 100-repeated 5-fold CV, the final AUC is based on 100x5=500 AUC.

$$ \hat{AUC}^* = \frac{1}{500} \sum{j=1}^{100} \sum{i=1}^{5} AUC_{fold.i}^{rep.j} $$

So I think $Var(\hat{AUC}^*) \le Var(\hat{AUC})$. Doesn't it?

If what you measure is just $Var(AUC{fold.i})$ and $Var(AUC{fold.i}^{rep.j})$, then as you said, the both do not depend on the number of repetitions, and hence they are equal.

tagteam commented 3 months ago

no, I don't think so and your code example does not show it either. note that when you repeat cv-5 two times the estimator implemented in riskRegression simply averages the two cv-5 estimates which differ only in the way the data are split into 5 pieces. by repeating the cv-5 many times you decrease the dependence on how the data are split into 5 pieces but not the variance of the estimator.

to see what is going on more formally you need make the random variables in your formula for the estimators explicit so that you can calculate and compare the variances. to do this it may help to think about the situation in the following way: you have a sample size of n and you really want to estimate the performance of the model that your modelling algorithm produces when trained with the full sample. this is not possible because there would be no independent validation data. the best we can do is to cross-validate: we estimate the average performance of the modelling algorithm when repeatedly trained with a lower sample size. in the CV-5 case the training size is roughly (n-n/5). if we had all computer power in the world we would run through all possible subsets of size (n-n/5). but this would only minimize the variability which is due to taking a random subset of all possible subsets of size (n-n/5), i.e., the way we split the data into 5 pieces. It would certainly not systematically affect the variablity which is coming from the random variables that enter the AUC or the variability of the algorithm which is used to train the models. the latter two dominate the variance of the estimator implemented in riskRegression.

proposed notation:

n sample size m = n-n/5 subsample size D_n = (Y_i,Xi){i=1}^n full sample D_m^b b'th subsample V_m^b = D_n \ D_m^b out-of-bag data not in subsample r_m^b prediction model obtained when algorithm is trained on D_m^b r_m^b(X_i) predicted risk from model at X_i

AUC being a rank statistic (U-statistic) is hard to express (in an email) but you could look at the Brier score (quadratic loss) to begin with.

Darren Tsai @.***> writes:

Thanks for your feedback! As far as I know, repeating the k-fold cross-validation reduces the variability of the performance estimate, leading to a more stable and reliable assessment.

For a 5-fold CV, the final AUC is just the average of 5 AUC from each fold.

$$ \hat{AUC} = \frac{1}{5} \sum{i=1}^{5} AUC{fold.i} $$

But for a 100-repeated 5-fold CV, the final AUC is based on 100x5=500 AUC.

$$ \hat{AUC}^* = \frac{1}{500} \sum{j=1}^{100} \sum{i=1}^{5} AUC_{fold.i}^{rep.j} $$

So I think $Var(\hat{AUC}^*) \le Var(\hat{AUC})$. Doesn't it?

-- 7LL-7 Smile, you don’t own all the problems in the world.

darentsai commented 3 months ago

Very grateful for your detailed explanation! I think I misunderstood which type of variance this package measures.

note that when you repeat cv-5 two times the estimator implemented in riskRegression simply averages the two cv-5 estimates which differ only in the way the data are split into 5 pieces.

If I do 10-repetition cv-5, and get $AUC_i (i=1,...,10)$, the final estimate is: $\bar{AUC} = 1/10 \sum AUC_i$.

The variance you are actually measuring is $Var(AUC_i)$, but I mistakenly thought what you measure is $Var(\bar{AUC})$.

If I'm concerned with the latter, I think the magnitude will depend on the number of repetitions (right?)

tagteam / riskRegression

Repeated cross-validation does not decrease standard error #59