tidymodels / themis

Extra recipes steps for dealing with unbalanced data
https://themis.tidymodels.org/
Other
141 stars 11 forks source link

Clarify RANN::nn2 error when invoked in step_smote #29

Closed dedzo closed 4 years ago

dedzo commented 4 years ago

Running step_smote invokes a (correct) error in the RANN package if there are classes in the data with fewer observations than the neighbors parameter; this can occur with small data sets, or modest ones when performing cross validation.

It would be helpful to catch this in a test internally to themis so that the error message is easier to debug (the RANN error refers to different variable names, and requires the user to have a better knowledge of SMOTE to debug); eg.

Error in themis::step_smote: neighbors must be below the smallest class size

The reprex below demonstrates the current behaviour:

library(tidyverse)
library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0      ✓ recipes   0.1.13
#> ✓ dials     0.0.8      ✓ rsample   0.0.7 
#> ✓ infer     0.5.3      ✓ tune      0.1.1 
#> ✓ modeldata 0.0.2      ✓ workflows 0.1.2 
#> ✓ parsnip   0.1.2      ✓ yardstick 0.0.7
#> ── Conflicts ────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter()   masks stats::filter()
#> x recipes::fixed()  masks stringr::fixed()
#> x dplyr::lag()      masks stats::lag()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step()   masks stats::step()
library(themis)
#> 
#> Attaching package: 'themis'
#> The following objects are masked from 'package:recipes':
#> 
#>     step_downsample, step_upsample, tunable.step_downsample,
#>     tunable.step_upsample

data("okc")

new_data<-okc

new_data$Class<-as.character(new_data$Class)

new_data$Class[1]<-"dummy"

new_data$Class<-as.factor(new_data$Class)

c<-recipe(Class ~ ., data =new_data) %>%
  update_role(date, new_role = 'date')%>%
  update_role(diet, new_role = 'diet')%>%
  update_role(location, new_role='location')%>%
  step_unknown(diet, new_level = 'unknown')%>%
  step_meanimpute(all_predictors()) %>%
  step_smote(Class) %>%
  prep()%>%
  juice()%>%
  view()
#> Error in RANN::nn2(data, k = k + 1, searchtype = "priority"): Cannot find more nearest neighbours than there are points

count(new_data, Class)
#> # A tibble: 3 x 2
#>   Class     n
#>   <fct> <int>
#> 1 dummy     1
#> 2 other 50315
#> 3 stem   9539
EmilHvitfeldt commented 4 years ago

This should be fixed now:

library(tidyverse)
library(tidymodels)
#> ── Attaching packages ───────────────────────────────── tidymodels 0.1.1.9000 ──
#> ✓ broom     0.7.0          ✓ recipes   0.1.15    
#> ✓ dials     0.0.9.9000     ✓ rsample   0.0.8     
#> ✓ infer     0.5.3          ✓ tune      0.1.1.9000
#> ✓ modeldata 0.1.0          ✓ workflows 0.1.3     
#> ✓ parsnip   0.1.4          ✓ yardstick 0.0.7
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter()   masks stats::filter()
#> x recipes::fixed()  masks stringr::fixed()
#> x dplyr::lag()      masks stats::lag()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step()   masks stats::step()
library(themis)
#> Registered S3 methods overwritten by 'themis':
#>   method                  from   
#>   bake.step_downsample    recipes
#>   bake.step_upsample      recipes
#>   prep.step_downsample    recipes
#>   prep.step_upsample      recipes
#>   tidy.step_downsample    recipes
#>   tidy.step_upsample      recipes
#>   tunable.step_downsample recipes
#>   tunable.step_upsample   recipes
#> 
#> Attaching package: 'themis'
#> The following objects are masked from 'package:tune':
#> 
#>     required_pkgs, tunable
#> The following objects are masked from 'package:recipes':
#> 
#>     step_downsample, step_upsample

data("okc")

new_data<-okc

new_data$Class<-as.character(new_data$Class)

new_data$Class[1]<-"dummy"

new_data$Class<-as.factor(new_data$Class)

c<-recipe(Class ~ ., data =new_data) %>%
  update_role(date, new_role = 'date')%>%
  update_role(diet, new_role = 'diet')%>%
  update_role(location, new_role='location')%>%
  step_unknown(diet, new_level = 'unknown')%>%
  step_meanimpute(all_predictors()) %>%
  step_smote(Class) %>%
  prep()%>%
  juice()%>%
  view()
#> Error: Not enough observations of 'dummy' to perform SMOTE.

count(new_data, Class)
#> # A tibble: 3 x 2
#>   Class     n
#>   <fct> <int>
#> 1 dummy     1
#> 2 other 50315
#> 3 stem   9539

Created on 2020-11-11 by the reprex package (v0.3.0)

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.