tidymodels / rsample

Classes and functions to create and summarize resampling objects
https://rsample.tidymodels.org
Other
341 stars 67 forks source link

Make default for `v` consistent between `vfold_cv()` and `group_vfold_cv()` #328

Open juliasilge opened 2 years ago

juliasilge commented 2 years ago

There is a lot of nice consistency between the new grouped variants implemented in #313, #315, and #316, but the original, older grouped resampling function group_vfold_cv() doesn't have the same default as its sister function vfold_cv(). With the default of v = NULL, it currently returns what is also called "leave-group-out" CV.

As @mattwarkentin correctly points out in #324:

Over time this may be confusing as users may grow to expect the group_* version of something to return a sampling pattern similar to the default for its non-grouped sibling.

Unfortunately, making these defaults consistent would be a breaking change, and also unfortunately it's the kind of breaking change we don't have good options for to communicate with users about. On the other hand, we know that group_vfold_cv() isn't a heavily used function in rsample.

🎯 What is the best option to pursue in this situation?

mattwarkentin commented 2 years ago

Related to the consistency of arguments, should the newly-added repeats argument for group_vfold_cv() come after v in the arg list (i.e., before balance), as is the case for its non-grouped sibling vfold_cv()?

It borks the positional arg matching with the current implementation:

library(rsample)

vfold_cv(mtcars, 5, 2)
#> #  5-fold cross-validation repeated 2 times 
#> # A tibble: 10 × 3
#>    splits         id      id2  
#>    <list>         <chr>   <chr>
#>  1 <split [25/7]> Repeat1 Fold1
#>  2 <split [25/7]> Repeat1 Fold2
#>  3 <split [26/6]> Repeat1 Fold3
#>  4 <split [26/6]> Repeat1 Fold4
#>  5 <split [26/6]> Repeat1 Fold5
#>  6 <split [25/7]> Repeat2 Fold1
#>  7 <split [25/7]> Repeat2 Fold2
#>  8 <split [26/6]> Repeat2 Fold3
#>  9 <split [26/6]> Repeat2 Fold4
#> 10 <split [26/6]> Repeat2 Fold5

group_vfold_cv(mtcars, cyl, 5, 2)
#> Error in `group_vfold_cv()`:
#> ! `balance` must be a character vector, not a number.
mikemahoney218 commented 2 years ago

Related to the consistency of arguments, should the newly-added repeats argument for group_vfold_cv() come after v in the arg list (i.e., before balance), as is the case for its non-grouped sibling vfold_cv()?

I opened #334 for this. That makes a lot of sense to me (and balance hasn't hit CRAN yet so I feel fine just doing that)