tidymodels / themis

Extra recipes steps for dealing with unbalanced data
https://themis.tidymodels.org/
Other
141 stars 11 forks source link

How to handle id-variables when upsampling #20

Closed EmilHvitfeldt closed 4 years ago

EmilHvitfeldt commented 4 years ago

Currency steps in themis will use the whole dataset when upsampling and downsampling. This is not much of an issue in step_upsample() and step_downsample(). But it becomes an issue if you specify an ID variable and use steps such as step_smote() and step_rose() as they produce synthetic observations.

As I see it there are 2 ways to deal with this

Something has to be done since the current implementation can introduce bias by letting the id variable be part of the space.

@juliasilge, @topepo any preferences? the second option seems nice but I want to make sure It doesn't break anything down the line.

library(recipes)
library(themis)

circle_example2 <- circle_example %>%
  mutate(id = row_number()) %>%
  as_tibble()

circle_example2
#> # A tibble: 400 x 4
#>        x     y class     id
#>    <dbl> <dbl> <fct>  <int>
#>  1  2.59 10.6  Rest       1
#>  2  9.71  6.83 Circle     2
#>  3  9.53 11.6  Rest       3
#>  4  9.73 11.9  Rest       4
#>  5 13.1   9.03 Rest       5
#>  6  9.96  3.64 Rest       6
#>  7  1.13 11.6  Rest       7
#>  8  4.26  2.30 Rest       8
#>  9 10.3   9.71 Circle     9
#> 10  8.20  6.82 Circle    10
#> # … with 390 more rows

recipe(class ~ ., data = circle_example2) %>% 
  update_role(id, new_role = "id") %>%
  step_smote(class) %>%
  prep() %>%
  juice() %>%
  slice(300:310) %>%
  as.data.frame()
#>           x         y       id  class
#> 1  8.479454  9.933081 362.6342 Circle
#> 2  8.715859 10.423173 354.7543 Circle
#> 3  8.427573  8.039237 358.1748 Circle
#> 4  8.388349  9.849708 363.2593 Circle
#> 5  8.068792  7.173167 359.7659 Circle
#> 6  6.287386  8.196584 362.4729 Circle
#> 7  6.188620  9.263268 345.1100 Circle
#> 8  6.688060  8.316327 349.3773 Circle
#> 9  6.885655  7.941686 351.0656 Circle
#> 10 7.255877  8.759476 371.5508 Circle
#> 11 8.543423  9.023393 361.1942 Circle

Created on 2020-05-26 by the reprex package (v0.3.0)

ref: https://github.com/tidymodels/themis/issues/19

topepo commented 4 years ago

I think that, when making an artificial data point, an NA value would be appropriate. If a row is being completely cloned, then keep the same ID. So, basically keep the ID field (and anything else that is not a predictor or outcome) with the data set.

juliasilge commented 4 years ago

Most of the real-world use cases I've seen with ID variables have involved joining back to the original data or juicing to get the training data, etc, so I cannot think of examples where creating NA values for synthetic data would be problematic. 🤞

hnagaty commented 4 years ago

May I point also that if the id is alphanumeric, the prep() fails with error code "Error: All columns selected for the step should be numeric"

library(recipes)
library(themis)
library(stringi)

# create an alphanumeric id
circle_example2 <- circle_example %>%
  mutate(id = stringi::stri_rand_strings(n(), 9, pattern = "[A-Z0-9]")) %>%
  as_tibble()

circle_example2
#> # A tibble: 400 x 4
#>        x     y class  id       
#>    <dbl> <dbl> <fct>  <chr>    
#>  1  2.59 10.6  Rest   V0LEJ5UIE
#>  2  9.71  6.83 Circle QLFQS27KE
#>  3  9.53 11.6  Rest   YWFXJH22L
#>  4  9.73 11.9  Rest   5PUKM8DBN
#>  5 13.1   9.03 Rest   IDAOE0ZZ1
#>  6  9.96  3.64 Rest   IFNI01TZG
#>  7  1.13 11.6  Rest   8LW5XJH9Z
#>  8  4.26  2.30 Rest   C3AFVPI1S
#>  9 10.3   9.71 Circle M30CMUXVV
#> 10  8.20  6.82 Circle 8G5B9GTJ9
#> # … with 390 more rows

recipe(class ~ ., data = circle_example2) %>% 
  update_role(id, new_role = "id") %>%
  step_smote(class) %>%
  prep()
#> Error: All columns selected for the step should be numeric

Created on 2020-08-09 by the reprex package (v0.3.0)

EmilHvitfeldt commented 4 years ago

This should be resolved now!

library(recipes)
library(themis)

circle_example2 <- circle_example %>%
  mutate(id = as.character(row_number())) %>%
  as_tibble()

res <- recipe(class ~ ., data = circle_example2) %>% 
  update_role(id, new_role = "id") %>%
  step_smote(class) %>%
  prep() %>%
  juice()

res
#> # A tibble: 684 x 4
#>        x     y id    class 
#>    <dbl> <dbl> <fct> <fct> 
#>  1  2.59 10.6  1     Rest  
#>  2  9.71  6.83 2     Circle
#>  3  9.53 11.6  3     Rest  
#>  4  9.73 11.9  4     Rest  
#>  5 13.1   9.03 5     Rest  
#>  6  9.96  3.64 6     Rest  
#>  7  1.13 11.6  7     Rest  
#>  8  4.26  2.30 8     Rest  
#>  9 10.3   9.71 9     Circle
#> 10  8.20  6.82 10    Circle
#> # … with 674 more rows

sum(is.na(res$id))
#> [1] 284

Created on 2020-08-11 by the reprex package (v0.3.0)

arno12 commented 4 years ago

Hi everyone,

I am taking the liberty to write down my personal experience with this as it might be related. Feel free to discard. I have recently been stuck with the error "Error: All columns selected for the step should be numeric" while using step_rose inside a recipe containing an id role. I was glad to see that a fix has been applied to fill id's with NA's while balancing the sample with step_rose. I have downloaded the dev package of themis and restarted my session.

However, this didn't do it for me as a new, relatively mysterious error was raised when I attempted to tune the prep'd recipe: recipe: Error: Can't use NA as column index with [ at position 1. I'm not quite sure how I should proceed to debug this.

I've tried to verify that id's had been filled just like @EmilHvitfeldt proved it above but I cannot find the id role when juicing my prep'd recipe.

EmilHvitfeldt commented 4 years ago

Can you provide a minimal reproducible example? (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.

I'm not quite sure that your error is with {themis}, but that you are using the "id" column further down the line in another function that doesn't accept NA's.

library(recipes)
library(themis)

circle_example2 <- circle_example %>%
  mutate(id = as.character(row_number())) %>%
  as_tibble()

res <- recipe(class ~ ., data = circle_example2) %>% 
  update_role(id, new_role = "id") %>%
  step_rose(class) %>%
  prep() %>%
  juice()

res
#> # A tibble: 684 x 4
#>         x     y id    class
#>     <dbl> <dbl> <fct> <fct>
#>  1  8.76  15.4  1     Rest 
#>  2 14.1    8.71 2     Rest 
#>  3 17.0    8.35 3     Rest 
#>  4  2.66   3.31 4     Rest 
#>  5 -1.55   3.03 5     Rest 
#>  6  8.62  11.9  6     Rest 
#>  7  1.65  -1.66 7     Rest 
#>  8  5.43  10.8  8     Rest 
#>  9 -0.201  1.84 9     Rest 
#> 10  3.77   9.37 10    Rest 
#> # … with 674 more rows

Created on 2020-08-19 by the reprex package (v0.3.0)

arno12 commented 4 years ago

Thanks @EmilHvitfeldt and apologies for not doing a Reprex in the first place. Here is a reproducible example based off the test dataframe you provided, with an additional categorical variable to create dummies.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(themis)
#> Registered S3 methods overwritten by 'themis':
#>   method               from   
#>   bake.step_downsample recipes
#>   bake.step_upsample   recipes
#>   prep.step_downsample recipes
#>   prep.step_upsample   recipes
#>   tidy.step_downsample recipes
#>   tidy.step_upsample   recipes
#> 
#> Attaching package: 'themis'
#> The following objects are masked from 'package:recipes':
#> 
#>     step_downsample, step_upsample, tunable.step_downsample,
#>     tunable.step_upsample
library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0     ✓ rsample   0.0.7
#> ✓ dials     0.0.8     ✓ tibble    3.0.3
#> ✓ ggplot2   3.3.2     ✓ tidyr     1.1.1
#> ✓ infer     0.5.3     ✓ tune      0.1.1
#> ✓ modeldata 0.0.2     ✓ workflows 0.1.2
#> ✓ parsnip   0.1.2     ✓ yardstick 0.0.7
#> ✓ purrr     0.3.4
#> ── Conflicts ────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()          masks scales::discard()
#> x dplyr::filter()           masks stats::filter()
#> x dplyr::lag()              masks stats::lag()
#> x recipes::step()           masks stats::step()
#> x themis::step_downsample() masks recipes::step_downsample()
#> x themis::step_upsample()   masks recipes::step_upsample()

set.seed(212)

circle_example2 <- themis::circle_example %>%
  mutate(id = row_number(),
         category = factor(case_when(class=='Rest' & y < 9 ~ 'Penguin',
                              class=='Rest' ~ 'Squirrel',
                              TRUE ~ 'Fox'))) %>%
  as_tibble()

processed_recipe <-
  recipe(class ~ ., data = circle_example2) %>%
  update_role(id, new_role = "id") %>%
  step_other(all_nominal(), -all_outcomes(), threshold = 0.05) %>%
  step_dummy(all_predictors(), 
             -all_numeric(), 
             -all_outcomes(), 
             one_hot = TRUE) %>% 
  #step_nzv(all_nominal(), -all_outcomes()) %>%
  step_zv(all_predictors()) %>% 
  step_rose(class, seed = 212) %>% 
  prep() 

spec <- boost_tree(
  trees = 500, 
  tree_depth = tune(), 
  min_n = tune(), 
  loss_reduction = tune(),       ## first three: model complexity
  sample_size = tune(), 
  mtry = tune(),                 ## randomness
  learn_rate = tune()            ## step size  
) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") %>% 
  translate()

# grid specification
grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), circle_example2),
  learn_rate(),
  size = 30
)

wflow <- 
  workflow() %>% 
  add_recipe(processed_recipe) %>% 
  add_model(spec) 

folds <- vfold_cv(circle_example2, 
                     v = 10, 
                     strata = class, 
                     repeats = 1)

doParallel::registerDoParallel()

res <- tune_grid(
  wflow,
  resamples = folds,
  grid = grid,
  control = control_grid(save_pred = TRUE, 
                         verbose = TRUE)
)
#> Warning: All models failed in tune_grid(). See the `.notes` column.

res$.notes
#> [[1]]
#> # A tibble: 1 x 1
#>   .notes                                                             
#>   <chr>                                                              
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#> 
#> [[2]]
#> # A tibble: 1 x 1
#>   .notes                                                             
#>   <chr>                                                              
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#> 
#> [[3]]
#> # A tibble: 1 x 1
#>   .notes                                                             
#>   <chr>                                                              
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#> 
#> [[4]]
#> # A tibble: 1 x 1
#>   .notes                                                             
#>   <chr>                                                              
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#> 
#> [[5]]
#> # A tibble: 1 x 1
#>   .notes                                                             
#>   <chr>                                                              
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#> 
#> [[6]]
#> # A tibble: 1 x 1
#>   .notes                                                             
#>   <chr>                                                              
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#> 
#> [[7]]
#> # A tibble: 1 x 1
#>   .notes                                                             
#>   <chr>                                                              
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#> 
#> [[8]]
#> # A tibble: 1 x 1
#>   .notes                                                             
#>   <chr>                                                              
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#> 
#> [[9]]
#> # A tibble: 1 x 1
#>   .notes                                                             
#>   <chr>                                                              
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#> 
#> [[10]]
#> # A tibble: 1 x 1
#>   .notes                                                             
#>   <chr>                                                              
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] yardstick_0.0.7   workflows_0.1.2   tune_0.1.1        tidyr_1.1.1      
#>  [5] tibble_3.0.3      rsample_0.0.7     purrr_0.3.4       parsnip_0.1.2    
#>  [9] modeldata_0.0.2   infer_0.5.3       ggplot2_3.3.2     dials_0.0.8      
#> [13] scales_1.1.1      broom_0.7.0       tidymodels_0.1.1  themis_0.1.2.9000
#> [17] recipes_0.1.13    dplyr_1.0.2      
#> 
#> loaded via a namespace (and not attached):
#>  [1] lubridate_1.7.9    doParallel_1.0.15  DiceDesign_1.8-1   tools_4.0.2       
#>  [5] backports_1.1.8    utf8_1.1.4         R6_2.4.1           rpart_4.1-15      
#>  [9] colorspace_1.4-1   nnet_7.3-14        withr_2.2.0        tidyselect_1.1.0  
#> [13] compiler_4.0.2     parallelMap_1.5.0  cli_2.0.2          checkmate_2.0.0   
#> [17] stringr_1.4.0      digest_0.6.25      rmarkdown_2.3      unbalanced_2.0    
#> [21] pkgconfig_2.0.3    htmltools_0.5.0    lhs_1.0.2          highr_0.8         
#> [25] rlang_0.4.7        rstudioapi_0.11    BBmisc_1.11        FNN_1.1.3         
#> [29] generics_0.0.2     magrittr_1.5       ROSE_0.0-3         Matrix_1.2-18     
#> [33] Rcpp_1.0.5         munsell_0.5.0      fansi_0.4.1        GPfit_1.0-8       
#> [37] lifecycle_0.2.0    furrr_0.1.0        stringi_1.4.6      pROC_1.16.2       
#> [41] yaml_2.2.1         MASS_7.3-51.6      plyr_1.8.6         grid_4.0.2        
#> [45] parallel_4.0.2     listenv_0.8.0      crayon_1.3.4       lattice_0.20-41   
#> [49] splines_4.0.2      knitr_1.29         mlr_2.17.1         pillar_1.4.6      
#> [53] xgboost_1.1.1.1    codetools_0.2-16   fastmatch_1.1-0    glue_1.4.1        
#> [57] evaluate_0.14      ParamHelpers_1.14  data.table_1.13.0  vctrs_0.3.2       
#> [61] foreach_1.5.0      gtable_0.3.0       RANN_2.6.1         future_1.18.0     
#> [65] assertthat_0.2.1   xfun_0.15          gower_0.2.2        prodlim_2019.11.13
#> [69] class_7.3-17       survival_3.1-12    timeDate_3043.102  iterators_1.0.12  
#> [73] lava_1.6.7         globals_0.12.5     ellipsis_0.3.1     ipred_0.9-9

Created on 2020-08-20 by the reprex package (v0.3.0)

EmilHvitfeldt commented 4 years ago

If you don't prep the recipe before adding it to the workflow then it should work.

If you have any other problems, please add it into a new issue :)

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(themis)
#> Registered S3 methods overwritten by 'themis':
#>   method               from   
#>   bake.step_downsample recipes
#>   bake.step_upsample   recipes
#>   prep.step_downsample recipes
#>   prep.step_upsample   recipes
#>   tidy.step_downsample recipes
#>   tidy.step_upsample   recipes
#> 
#> Attaching package: 'themis'
#> The following objects are masked from 'package:recipes':
#> 
#>     step_downsample, step_upsample, tunable.step_downsample,
#>     tunable.step_upsample
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0     ✓ rsample   0.0.7
#> ✓ dials     0.0.8     ✓ tibble    3.0.3
#> ✓ ggplot2   3.3.2     ✓ tidyr     1.1.1
#> ✓ infer     0.5.3     ✓ tune      0.1.1
#> ✓ modeldata 0.0.2     ✓ workflows 0.1.3
#> ✓ parsnip   0.1.3     ✓ yardstick 0.0.7
#> ✓ purrr     0.3.4
#> ── Conflicts ──────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()          masks scales::discard()
#> x dplyr::filter()           masks stats::filter()
#> x dplyr::lag()              masks stats::lag()
#> x recipes::step()           masks stats::step()
#> x themis::step_downsample() masks recipes::step_downsample()
#> x themis::step_upsample()   masks recipes::step_upsample()

set.seed(212)

circle_example2 <- themis::circle_example %>%
  mutate(id = row_number(),
         category = factor(case_when(class=='Rest' & y < 9 ~ 'Penguin',
                              class=='Rest' ~ 'Squirrel',
                              TRUE ~ 'Fox'))) %>%
  as_tibble()

processed_recipe <-
  recipe(class ~ ., data = circle_example2) %>%
  update_role(id, new_role = "id") %>%
  step_other(all_nominal(), -all_outcomes(), threshold = 0.05) %>%
  step_dummy(all_predictors(), 
             -all_numeric(), 
             -all_outcomes(), 
             one_hot = TRUE) %>% 
  #step_nzv(all_nominal(), -all_outcomes()) %>%
  step_zv(all_predictors()) %>% 
  step_rose(class, seed = 212) %>% 
  prep() 

spec <- boost_tree(
  trees = 500, 
  tree_depth = tune(), 
  min_n = tune(), 
  loss_reduction = tune(),       ## first three: model complexity
  sample_size = tune(), 
  mtry = tune(),                 ## randomness
  learn_rate = tune()            ## step size  
) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") %>% 
  translate()

# grid specification
grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), circle_example2),
  learn_rate(),
  size = 30
)

set.seed(212)

circle_example2 <- themis::circle_example %>%
  mutate(id = row_number(),
         category = factor(case_when(class=='Rest' & y < 9 ~ 'Penguin',
                              class=='Rest' ~ 'Squirrel',
                              TRUE ~ 'Fox'))) %>%
  as_tibble()

processed_recipe <-
  recipe(class ~ ., data = circle_example2) %>%
  update_role(id, new_role = "id") %>%
  step_other(all_nominal(), -all_outcomes(), threshold = 0.05) %>%
  step_dummy(all_predictors(), 
             -all_numeric(), 
             -all_outcomes(), 
             one_hot = TRUE) %>% 
  #step_nzv(all_nominal(), -all_outcomes()) %>%
  step_zv(all_predictors()) %>% 
  step_rose(class, seed = 212)

spec <- boost_tree(
  trees = 500, 
  tree_depth = tune(), 
  min_n = tune(), 
  loss_reduction = tune(),       ## first three: model complexity
  sample_size = tune(), 
  mtry = tune(),                 ## randomness
  learn_rate = tune()            ## step size  
) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") %>% 
  translate()

# grid specification
grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), circle_example2),
  learn_rate(),
  size = 30
)

grid
#> # A tibble: 30 x 6
#>    tree_depth min_n loss_reduction sample_size  mtry learn_rate
#>         <int> <int>          <dbl>       <dbl> <int>      <dbl>
#>  1          3    10 0.349                0.335     5   5.23e- 8
#>  2          2     5 0.00000417           0.675     2   2.32e- 3
#>  3         13     3 0.000600             0.772     4   2.04e- 5
#>  4          7    17 0.0514               0.243     5   2.57e- 4
#>  5          9     4 0.000000000251       0.142     1   5.40e- 6
#>  6         11     6 0.00000000635        0.958     2   7.80e- 7
#>  7         13    30 0.0000000391         0.489     3   4.60e- 4
#>  8         15     9 0.0255               0.124     4   1.35e- 2
#>  9          3    28 0.000000714          1.00      3   7.70e-10
#> 10         10    25 5.60                 0.593     3   2.79e- 7
#> # … with 20 more rows

Created on 2020-08-21 by the reprex package (v0.3.0)

arno12 commented 4 years ago

thank you!

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.