tidymodels / rsample

Classes and functions to create and summarize resampling objects
https://rsample.tidymodels.org
Other
341 stars 67 forks source link

more flexibility with stratification / grouped sampling #211

Closed ColinConwell closed 3 years ago

ColinConwell commented 3 years ago

Thank you for all your hard work on rsample. I know this has been a topic of some debate in the past (and I do see a number of closed and open issues pertaining to this), but I've been finding the combination of single stratification and pooling ceiling in rsample to be noxiously limiting. Working in an empirical domain where we often have condition-rich designs, but small samples (e.g. neuroimaging), it's imperative we be able to perform stratified resampling with a bit more flexibility across multiple groups.

I'm consistently running into problems that result in "Warning message: Too little data to stratify", despite knowing exactly how much data I expect to be in each stratum and being willing to accept the limitations thereof.

Since the deprecation of broom's bootstrap (which allowed resampling on a grouped tibble), rsample is increasingly the main package that facilitates these operations. I definitely empathize with many of the issues that result from giving the user more freedom to specify their stratification strategy, but the opposite means I'm having to effectively reimplement the wheel to get the flexibility I need, which seems counterproductive.

Perhaps just a series of very robust warnings will be sufficient to wipe your hands of the issues that result from users abusing this flexibility? My thanks in advance for your consideration. I appreciate it!

juliasilge commented 3 years ago

Thanks so much for this feedback @ColinConwell. As we look back at how this feature has been used by folks with various constraints, we are now considering how to expose that argument to users so it could be changed in some situations.

We'll want to generate warnings based on our opinionated take of what is "too low" and include documentation to indicate that lowering the argument (currently pool in make_strata()) may result in... 💣 ☢️ ☠️

ColinConwell commented 3 years ago

Totally understandable! I think a strong warning and a conservative default in this case would be sufficient to buttress the opinion, but the flexibility, I think, is definitely key as well. If the user persists past what is reasonable or pragmatic, it's not then a fault of the software. The ability to set pool should also, I reckon, cover the vast majority of use cases I was considering. Thanks again for the reply.

juliasilge commented 3 years ago

Thanks for your patience @ColinConwell! 🙌 You can now get this feature by installing from GitHub:

devtools::install_github("tidymodels/rsample")

It is now implemented for the main user-facing resampling functions such as vfold_cv(), mc_cv(), and friends:

library(tidyverse)
library(rsample)

df <- tibble(x = rnorm(60), label = rep(letters[1:12], each = 5))

mc_cv(df, v = 3, strata = label)
#> Warning: Too little data to stratify. Unstratified resampling
#> # Monte Carlo cross-validation (0.75/0.25) with 25 resamples  using stratification 
#> # A tibble: 25 x 2
#>    splits          id        
#>    <list>          <chr>     
#>  1 <split [45/15]> Resample01
#>  2 <split [45/15]> Resample02
#>  3 <split [45/15]> Resample03
#>  4 <split [45/15]> Resample04
#>  5 <split [45/15]> Resample05
#>  6 <split [45/15]> Resample06
#>  7 <split [45/15]> Resample07
#>  8 <split [45/15]> Resample08
#>  9 <split [45/15]> Resample09
#> 10 <split [45/15]> Resample10
#> # … with 15 more rows
mc_cv(df, v = 3, strata = label, pool = 0.05)
#> Warning: Stratifying groups that make up 5% of the data may be statistically risky.
#> Consider increasing `pool` to at least 0.1
#> # Monte Carlo cross-validation (0.75/0.25) with 25 resamples  using stratification 
#> # A tibble: 25 x 2
#>    splits          id        
#>    <list>          <chr>     
#>  1 <split [36/24]> Resample01
#>  2 <split [36/24]> Resample02
#>  3 <split [36/24]> Resample03
#>  4 <split [36/24]> Resample04
#>  5 <split [36/24]> Resample05
#>  6 <split [36/24]> Resample06
#>  7 <split [36/24]> Resample07
#>  8 <split [36/24]> Resample08
#>  9 <split [36/24]> Resample09
#> 10 <split [36/24]> Resample10
#> # … with 15 more rows

Created on 2021-03-18 by the reprex package (v1.0.0)

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.