jashu / beset

Best Subset Predictive Modeling
5 stars 0 forks source link

`create_folds` can return empty folds when sample size is small (N / fold is in single digits) #2

Closed jashu closed 4 years ago

jashu commented 4 years ago

Using validate with default arguments (10-fold CV) and smaller sample sizes (~ N < 75)---or using functions that call validate (e.g., beset_elnet) under the same circumstances---occasionally results in error that traces back to utility function create_folds, which applies stratified random sampling to assign observations to folds in an attempt to balance an equal N per fold while at the same time achieving a similar distribution of the outcome variable for each fold. As sample size drops below 75 (and the number of holdout cases in each fold is on average smaller than 7.5), equalizing distributions becomes impossible, but the current algorithm continues to attempt stratification, resulting in infrequent random failure to assign any holdout cases to one or more of the requested folds. For example, with 10-fold cross-validation, this error was found to occur with the following frequency in a test simulation:

Arguably, k-fold cross-validation is not the optimal validation strategy for samples this small (as it approaches the high variance of leave-one-out cross-validation), but this should not result in failing to assign at least one individual to each hold-out fold.