tidymodels / rsample

Classes and functions to create and summarize resampling objects
https://rsample.tidymodels.org
Other
341 stars 66 forks source link

Allow for separate assessment dataset #255

Closed matthewrspiegel closed 2 years ago

matthewrspiegel commented 3 years ago

I’m not sure if this is possible already so please forgive me if so. Also, please let me know if this better fits in tune or elsewhere.

I have a situation where my dataset is too large for my server when utilizing parallel processing (copying the data to each worker instantly blows up the RAM). To get around this I want to undersample the majority class prior to passing it to the workers for tuning. However, this causes issues when using vfold_cv since the assessment dataset is one of the, now undersampled, folds.

Is there a way (and if not, should there be) to pass vfold_cv a separate data set for assessment purposes that preserves the class imbalance?

Note: I purposefully don’t want to shrink the entire dataset prior to training since I would be losing a lot of information on the minority class that is already fairly small.

I also wanted to mention I trained/tuned using the downsampled dataset (for both training and assessment) and it didn’t seem to matter when validating performance on an unseen dataset “in the wild.” So the solution might be that it doesn’t matter but that seems wrong to me and maybe a fluke.

Thanks!

Edit: Just noticed this issue/solution #355 in tune which may work for me as well.

juliasilge commented 3 years ago

Parallel processing does involve a memory/speed tradeoff, and in tidymodels/tune#397 we think we did what we could so that each worker only gets the data it needs (not the full set of resamples, but only the individual resample that that worker will use). In your case, the analysis set would then be downsampled and the assessment set would stay as is. Sounds like what you're saying is that the analysis set before downsampling is so big that this is still a problem for you.

The best solution might be that you scale back on the number of workers so you don't blow up the RAM usage so much.

I don't think it will make sense for a function like vfold_cv() to take a specified assessment set because it's not really cross-validation then, but you may want to use our tools for manually creating resampling objects, such as make_splits() and then manual_rset(). What this would let you do is manually downsample your data (maybe using prep()/bake() or just slice_sample()) and then create an rset that can be used for tuning:

library(rsample)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data(cells, package = "modeldata")

## this is like a single validation split: 
my_split <- make_splits(
  cells %>% filter(case == "Train") %>% slice_sample(n = 1000),
  cells %>% filter(case == "Test")
)
my_split
#> <Analysis/Assess/Total>
#> <1000/1010/2010>
manual_rset(list(my_split), ids = c("Split 1"))
#> # Manual resampling 
#> # A tibble: 1 × 2
#>   splits              id     
#>   <list>              <chr>  
#> 1 <split [1000/1010]> Split 1

## this is like sort of kind of not really like CV
my_split <- function() make_splits(
  ## no seed so randomly sampling each time; maybe use a better strategy for reproducibility
  cells %>% filter(case == "Train") %>% slice_sample(n = 1000),
  cells %>% filter(case == "Test")
)

purrr::map(1:5, ~my_split()) %>%
  manual_rset(ids = paste("Split", 1:5))
#> # Manual resampling 
#> # A tibble: 5 × 2
#>   splits              id     
#>   <list>              <chr>  
#> 1 <split [1000/1010]> Split 1
#> 2 <split [1000/1010]> Split 2
#> 3 <split [1000/1010]> Split 3
#> 4 <split [1000/1010]> Split 4
#> 5 <split [1000/1010]> Split 5

Created on 2021-09-17 by the reprex package (v2.0.1)

This would let you pass single resamples to the workers that are already downsampled. The way I showed doing it here requires the development version of rsample.

matthewrspiegel commented 3 years ago

Thank you for the detailed response!! Is the fix for parallel tuning you mention above in the CRAN release of tune? When I check top when tuning, it appears all of the folds are being passed to each worker. I can create a reprex if necessary but want to confirm that it isn’t just the development version. If it’s fixed in the CRAN release then maybe I just need to update.

Agreed that it doesn’t fit in vfold_cv since it wouldn’t really be CV anymore. I guess the easiest way to do this would be to have a step_downsample function that only evaluates on the assessment dataset to downsample the minority class back to the “real” distribution but that seems ultra specific and I wouldn’t know how to implement.

Anyway, thanks again and I’ll give your method a go!

juliasilge commented 3 years ago

Looks like that parallel processing fix for tune has not yet gone to CRAN so you'll need to install from GitHub

matthewrspiegel commented 3 years ago

Ah okay. I’m unfortunately unable to install from GitHub behind my jobs firewall (which I’m now realizing also makes the manual_rset solution above not possible for now).

Do you have any idea when CRAN tune and/or rsample will be updated?

juliasilge commented 3 years ago

We're currently working on a recipes release, so these will probably come a bit after that, probably within the next month or so. Until then, I would probably scale way back on the number of workers (to like, 2) and plan for longer training times. When you are all in the same session, the rsample approach to resamples is very memory efficient.

PathosEthosLogos commented 3 years ago

What this would let you do is manually downsample your data (maybe using prep()/bake()

I just wanted to make sure I understand this correctly

Given that df_test is a small portion of the dataset to test, and that spec_model is some model specification (linear_reg() for example) in this scenario:

dfp = df |>
  recipe(y ~ .) |>
  step_dummy(some_variable) |>
  prep()

spec_model |> fit(y ~ ., data = dfp |> bake(new_data = df_test))

is this what you mean?

juliasilge commented 3 years ago

No, more like this:

library(tidyverse)
library(rsample)
data(concrete, package = "modeldata")

concrete_split <- initial_split(concrete, prop = 0.8)

my_split <- function() make_splits(
  training(concrete_split) %>% slice_sample(n = 500),
  testing(concrete_split)
)

purrr::map(1:5, ~my_split()) %>%
  manual_rset(ids = paste("Split", 1:5))
#> # Manual resampling 
#> # A tibble: 5 × 2
#>   splits            id     
#>   <list>            <chr>  
#> 1 <split [500/206]> Split 1
#> 2 <split [500/206]> Split 2
#> 3 <split [500/206]> Split 3
#> 4 <split [500/206]> Split 4
#> 5 <split [500/206]> Split 5

Created on 2021-09-28 by the reprex package (v2.0.1)

If you have, say, way more training data than you want to use for tuning your model, you can manually make an rset object using the whole testing data and random subsets of the training data in each fold. This wouldn't be as good as regular cross validation but we do allow this kind of flexibility in manual_rset() for folks who have specific custom needs. That object I just created can be used for tuning or resampling a model. I did it here with slice_sample() but you could create a little tiny recipe and use step_downsample() if you prefer.

PathosEthosLogos commented 3 years ago

Then where would prep()/bake() be applied? If I'm understanding this correctly, it seems like that code would take care of the train-test split without using prep() and bake().

juliasilge commented 3 years ago

You would do it something like this, if you wanted to use step_downsample() to make sure the classes were balanced in the training set for tuning:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(themis)
#> Registered S3 methods overwritten by 'themis':
#>   method                  from   
#>   bake.step_downsample    recipes
#>   bake.step_upsample      recipes
#>   prep.step_downsample    recipes
#>   prep.step_upsample      recipes
#>   tidy.step_downsample    recipes
#>   tidy.step_upsample      recipes
#>   tunable.step_downsample recipes
#>   tunable.step_upsample   recipes
#> 
#> Attaching package: 'themis'
#> The following objects are masked from 'package:recipes':
#> 
#>     step_downsample, step_upsample
data(attrition, package = "modeldata")

attrition_split <- initial_split(attrition, prop = 0.8, strata = Attrition)

my_split <- function(split, outcome, seed) {
  baked_training <- 
    recipe(training(split)) %>%
    step_downsample({{outcome}}, seed = seed) %>%
    prep() %>%
    bake(new_data = NULL)

  make_splits(
    baked_training,
    testing(split)
  )
}

purrr::map(1:5, ~my_split(attrition_split, Attrition, seed = .)) %>%
  manual_rset(ids = paste("Split", 1:5))
#> # Manual resampling 
#> # A tibble: 5 × 2
#>   splits            id     
#>   <list>            <chr>  
#> 1 <split [378/295]> Split 1
#> 2 <split [378/295]> Split 2
#> 3 <split [378/295]> Split 3
#> 4 <split [378/295]> Split 4
#> 5 <split [378/295]> Split 5

Created on 2021-09-28 by the reprex package (v2.0.1)

juliasilge commented 2 years ago

We are very close to a tune release (keep your eyes out for that!) and I am spring cleaning repos so I am going to close this issue. Let us know if you have further question, about rsample or tune! 🙌

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.