mlr3resampling provides new cross-validation algorithms for the mlr3 framework in R
| [[file:tests/testthat][tests]] | [[https://github.com/tdhock/mlr3resampling/actions][https://github.com/tdhock/mlr3resampling/workflows/R-CMD-check/badge.svg]] | | [[https://github.com/jimhester/covr][coverage]] | [[https://app.codecov.io/gh/tdhock/mlr3resampling?branch=main][https://codecov.io/gh/tdhock/mlr3resampling/branch/main/graph/badge.svg]] |
** Installation
install.packages("mlr3resampling")#release version from CRAN
install.packages("remotes") remotes::install_github("tdhock/mlr3resampling")
** Description
For an overview of functionality, [[https://tdhock.github.io/blog/2024/cv-all-same-new/][please read my recent blog post]], the [[https://arxiv.org/abs/2410.08643][SOAK arXiv paper]], and [[https://github.com/tdhock/mlr3resampling/wiki/Articles][other articles]].
*** SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets
See examples in [[https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/Newer_resamplers.html][Newer resamplers vignette]] and data viz for [[https://tdhock.github.io/2023-12-13-train-predict-subsets-regression/][regression]] and [[https://tdhock.github.io/2023-12-13-train-predict-subsets-classification/][classification]].
A supervised learning algorithm inputs a train set, and outputs a prediction function, which can be used on a test set. If each data point belongs to a subset (such as geographic region, year, etc), then how do we know if subsets are similar enough so that we can get accurate predictions on one subset, after training on Other subsets? And how do we know if training on All subsets would improve prediction accuracy, relative to training on the Same subset? SOAK, Same/Other/All K-fold cross-validation, can be used to answer these question, by fixing a test subset, training models on Same/Other/All subsets, and then comparing test error rates (Same versus Other and Same versus All).
This is implemented in =ResamplingSameOtherSizesCV= when you use it on a task that defines the =subset= role, for example the Arizona trees data, for which each row is a pixel in an image, and we want to do binary classification -- does the pixel contain a tree or not?
data(AZtrees,package="mlr3resampling") table(AZtrees$region3)
NE NW S 1464 1563 2929
We see in the output above that the =region3= column has three values (NE, NW, S). Each represents the region/area in which the pixel was found. If we want good predictions in the south (S), can we train on the north? (NE+NW) We can use the code below to setup the CV experiment. The rows 12,15,18 below represent splits that attempt to answer that question (test.subset=S, train.subsets=other).
same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new() task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y") task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE) task.obj$col_roles$stratum <- "y" #keep data proportional when splitting. task.obj$col_roles$group <- "polygon" #keep data together when splitting. task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s). same_other_sizes_cv$instantiate(task.obj) same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)] test.subset train.subsets test.fold
1: NE all 1 2: NW all 1 3: S all 1 4: NE all 2 5: NW all 2 6: S all 2 7: NE all 3 8: NW all 3 9: S all 3 10: NE other 1 11: NW other 1 12: S other 1 13: NE other 2 14: NW other 2 15: S other 2 16: NE other 3 17: NW other 3 18: S other 3 19: NE same 1 20: NW same 1 21: S same 1 22: NE same 2 23: NW same 2 24: S same 2 25: NE same 3 26: NW same 3 27: S same 3 test.subset train.subsets test.fold #+end_src
The rows in the output above represent different kinds of splits:
Code to re-run:
data(AZtrees,package="mlr3resampling") table(AZtrees$region3) same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new() task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y") task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE) task.obj$col_roles$stratum <- "y" #keep data proportional when splitting. task.obj$col_roles$group <- "polygon" #keep data together when splitting. task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s). same_other_sizes_cv$instantiate(task.obj) same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]
*** Algorithm 2: cross-validation for comparing different sized train sets
See examples in [[https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/Newer_resamplers.html][Newer Resamplers vignette]] and data viz for [[https://tdhock.github.io/2023-12-26-train-sizes-regression/][regression]] and [[https://tdhock.github.io/2023-12-27-train-sizes-classification/][classification]].
How many train samples are required to get accurate predictions on a test set? Cross-validation can be used to answer this question, with variable size train sets. For example consider the Arizona Trees data below,
dim(AZtrees) [1] 5956 25 length(unique(AZtrees$polygon)) [1] 189
+end_src
The output above indicates we have 5956 rows and 189 polygons. We can do cross-validation on either polygons (if task has =group= role) or rows (if no =group= role set). The code below sets a down-sampling =ratio= of 0.8, and four =sizes= of down-sampled train sets.
same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new() same_other_sizes_cv$param_set$values$sizes <- 4 same_other_sizes_cv$param_set$values$ratio <- 0.8 task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y") task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE) task.obj$col_roles$stratum <- "y" #keep data proportional when splitting. task.obj$col_roles$group <- "polygon" #keep data together when splitting. same_other_sizes_cv$instantiate(task.obj) same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)] n.train.groups test.fold
1: 51 1 2: 64 1 3: 80 1 4: 100 1 5: 126 1 6: 51 2 7: 64 2 8: 80 2 9: 100 2 10: 126 2 11: 51 3 12: 64 3 13: 80 3 14: 100 3 15: 126 3 #+end_src
The output above has one row per train/test split that will be computed in the cross-validation experiment. The full train set size is 126 polygons, and there are four smaller train set sizes (each a factor of 0.8 smaller). Each train set size will be computed for each fold ID from 1 to 3.
Code to re-run:
data(AZtrees,package="mlr3resampling") dim(AZtrees) length(unique(AZtrees$polygon)) same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new() same_other_sizes_cv$param_set$values$sizes <- 4 same_other_sizes_cv$param_set$values$ratio <- 0.8 task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y") task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE) task.obj$col_roles$stratum <- "y" #keep data proportional when splitting. task.obj$col_roles$group <- "polygon" #keep data together when splitting. same_other_sizes_cv$instantiate(task.obj) same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]
** Related work