Using validate with default arguments (10-fold CV) and smaller sample sizes (~ N < 75)---or using functions that call validate (e.g., beset_elnet) under the same circumstances---occasionally results in error that traces back to utility function create_folds, which applies stratified random sampling to assign observations to folds in an attempt to balance an equal N per fold while at the same time achieving a similar distribution of the outcome variable for each fold. As sample size drops below 75 (and the number of holdout cases in each fold is on average smaller than 7.5), equalizing distributions becomes impossible, but the current algorithm continues to attempt stratification, resulting in infrequent random failure to assign any holdout cases to one or more of the requested folds. For example, with 10-fold cross-validation, this error was found to occur with the following frequency in a test simulation:
With N = 75, occurs 0 times out of every 10,000 random seeds.
With N = 70, occurs 6 times out of every 10,000 random seeds (0.06% of the time).
With N = 60, occurs 21 times out of every 10,000 random seeds (0.21 % of the time).
With N = 50, occurs 236 times out of every 10,000 random seeds (2.36 % of the time).
Arguably, k-fold cross-validation is not the optimal validation strategy for samples this small (as it approaches the high variance of leave-one-out cross-validation), but this should not result in failing to assign at least one individual to each hold-out fold.
Using
validate
with default arguments (10-fold CV) and smaller sample sizes (~ N < 75)---or using functions that callvalidate
(e.g.,beset_elnet
) under the same circumstances---occasionally results in error that traces back to utility functioncreate_folds
, which applies stratified random sampling to assign observations to folds in an attempt to balance an equal N per fold while at the same time achieving a similar distribution of the outcome variable for each fold. As sample size drops below 75 (and the number of holdout cases in each fold is on average smaller than 7.5), equalizing distributions becomes impossible, but the current algorithm continues to attempt stratification, resulting in infrequent random failure to assign any holdout cases to one or more of the requested folds. For example, with 10-fold cross-validation, this error was found to occur with the following frequency in a test simulation:Arguably, k-fold cross-validation is not the optimal validation strategy for samples this small (as it approaches the high variance of leave-one-out cross-validation), but this should not result in failing to assign at least one individual to each hold-out fold.