Closed EmilHvitfeldt closed 4 years ago
I think that, when making an artificial data point, an NA
value would be appropriate. If a row is being completely cloned, then keep the same ID. So, basically keep the ID field (and anything else that is not a predictor or outcome) with the data set.
Most of the real-world use cases I've seen with ID variables have involved joining back to the original data or juicing to get the training data, etc, so I cannot think of examples where creating NA
values for synthetic data would be problematic. 🤞
May I point also that if the id is alphanumeric, the prep()
fails with error code "Error: All columns selected for the step should be numeric"
library(recipes)
library(themis)
library(stringi)
# create an alphanumeric id
circle_example2 <- circle_example %>%
mutate(id = stringi::stri_rand_strings(n(), 9, pattern = "[A-Z0-9]")) %>%
as_tibble()
circle_example2
#> # A tibble: 400 x 4
#> x y class id
#> <dbl> <dbl> <fct> <chr>
#> 1 2.59 10.6 Rest V0LEJ5UIE
#> 2 9.71 6.83 Circle QLFQS27KE
#> 3 9.53 11.6 Rest YWFXJH22L
#> 4 9.73 11.9 Rest 5PUKM8DBN
#> 5 13.1 9.03 Rest IDAOE0ZZ1
#> 6 9.96 3.64 Rest IFNI01TZG
#> 7 1.13 11.6 Rest 8LW5XJH9Z
#> 8 4.26 2.30 Rest C3AFVPI1S
#> 9 10.3 9.71 Circle M30CMUXVV
#> 10 8.20 6.82 Circle 8G5B9GTJ9
#> # … with 390 more rows
recipe(class ~ ., data = circle_example2) %>%
update_role(id, new_role = "id") %>%
step_smote(class) %>%
prep()
#> Error: All columns selected for the step should be numeric
Created on 2020-08-09 by the reprex package (v0.3.0)
This should be resolved now!
library(recipes)
library(themis)
circle_example2 <- circle_example %>%
mutate(id = as.character(row_number())) %>%
as_tibble()
res <- recipe(class ~ ., data = circle_example2) %>%
update_role(id, new_role = "id") %>%
step_smote(class) %>%
prep() %>%
juice()
res
#> # A tibble: 684 x 4
#> x y id class
#> <dbl> <dbl> <fct> <fct>
#> 1 2.59 10.6 1 Rest
#> 2 9.71 6.83 2 Circle
#> 3 9.53 11.6 3 Rest
#> 4 9.73 11.9 4 Rest
#> 5 13.1 9.03 5 Rest
#> 6 9.96 3.64 6 Rest
#> 7 1.13 11.6 7 Rest
#> 8 4.26 2.30 8 Rest
#> 9 10.3 9.71 9 Circle
#> 10 8.20 6.82 10 Circle
#> # … with 674 more rows
sum(is.na(res$id))
#> [1] 284
Created on 2020-08-11 by the reprex package (v0.3.0)
Hi everyone,
I am taking the liberty to write down my personal experience with this as it might be related. Feel free to discard.
I have recently been stuck with the error "Error: All columns selected for the step should be numeric"
while using step_rose inside a recipe containing an id role. I was glad to see that a fix has been applied to fill id's with NA's while balancing the sample with step_rose. I have downloaded the dev package of themis
and restarted my session.
However, this didn't do it for me as a new, relatively mysterious error was raised when I attempted to tune the prep'd recipe:
recipe: Error: Can't use NA as column index with [ at position 1.
I'm not quite sure how I should proceed to debug this.
I've tried to verify that id's had been filled just like @EmilHvitfeldt proved it above but I cannot find the id role when juicing my prep'd recipe.
Can you provide a minimal reproducible example? (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.
I'm not quite sure that your error is with {themis}, but that you are using the "id" column further down the line in another function that doesn't accept NA's.
library(recipes)
library(themis)
circle_example2 <- circle_example %>%
mutate(id = as.character(row_number())) %>%
as_tibble()
res <- recipe(class ~ ., data = circle_example2) %>%
update_role(id, new_role = "id") %>%
step_rose(class) %>%
prep() %>%
juice()
res
#> # A tibble: 684 x 4
#> x y id class
#> <dbl> <dbl> <fct> <fct>
#> 1 8.76 15.4 1 Rest
#> 2 14.1 8.71 2 Rest
#> 3 17.0 8.35 3 Rest
#> 4 2.66 3.31 4 Rest
#> 5 -1.55 3.03 5 Rest
#> 6 8.62 11.9 6 Rest
#> 7 1.65 -1.66 7 Rest
#> 8 5.43 10.8 8 Rest
#> 9 -0.201 1.84 9 Rest
#> 10 3.77 9.37 10 Rest
#> # … with 674 more rows
Created on 2020-08-19 by the reprex package (v0.3.0)
Thanks @EmilHvitfeldt and apologies for not doing a Reprex in the first place. Here is a reproducible example based off the test dataframe you provided, with an additional categorical variable to create dummies.
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(themis)
#> Registered S3 methods overwritten by 'themis':
#> method from
#> bake.step_downsample recipes
#> bake.step_upsample recipes
#> prep.step_downsample recipes
#> prep.step_upsample recipes
#> tidy.step_downsample recipes
#> tidy.step_upsample recipes
#>
#> Attaching package: 'themis'
#> The following objects are masked from 'package:recipes':
#>
#> step_downsample, step_upsample, tunable.step_downsample,
#> tunable.step_upsample
library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom 0.7.0 ✓ rsample 0.0.7
#> ✓ dials 0.0.8 ✓ tibble 3.0.3
#> ✓ ggplot2 3.3.2 ✓ tidyr 1.1.1
#> ✓ infer 0.5.3 ✓ tune 0.1.1
#> ✓ modeldata 0.0.2 ✓ workflows 0.1.2
#> ✓ parsnip 0.1.2 ✓ yardstick 0.0.7
#> ✓ purrr 0.3.4
#> ── Conflicts ────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
#> x themis::step_downsample() masks recipes::step_downsample()
#> x themis::step_upsample() masks recipes::step_upsample()
set.seed(212)
circle_example2 <- themis::circle_example %>%
mutate(id = row_number(),
category = factor(case_when(class=='Rest' & y < 9 ~ 'Penguin',
class=='Rest' ~ 'Squirrel',
TRUE ~ 'Fox'))) %>%
as_tibble()
processed_recipe <-
recipe(class ~ ., data = circle_example2) %>%
update_role(id, new_role = "id") %>%
step_other(all_nominal(), -all_outcomes(), threshold = 0.05) %>%
step_dummy(all_predictors(),
-all_numeric(),
-all_outcomes(),
one_hot = TRUE) %>%
#step_nzv(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_rose(class, seed = 212) %>%
prep()
spec <- boost_tree(
trees = 500,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(),
mtry = tune(), ## randomness
learn_rate = tune() ## step size
) %>%
set_engine("xgboost") %>%
set_mode("classification") %>%
translate()
# grid specification
grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), circle_example2),
learn_rate(),
size = 30
)
wflow <-
workflow() %>%
add_recipe(processed_recipe) %>%
add_model(spec)
folds <- vfold_cv(circle_example2,
v = 10,
strata = class,
repeats = 1)
doParallel::registerDoParallel()
res <- tune_grid(
wflow,
resamples = folds,
grid = grid,
control = control_grid(save_pred = TRUE,
verbose = TRUE)
)
#> Warning: All models failed in tune_grid(). See the `.notes` column.
res$.notes
#> [[1]]
#> # A tibble: 1 x 1
#> .notes
#> <chr>
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#>
#> [[2]]
#> # A tibble: 1 x 1
#> .notes
#> <chr>
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#>
#> [[3]]
#> # A tibble: 1 x 1
#> .notes
#> <chr>
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#>
#> [[4]]
#> # A tibble: 1 x 1
#> .notes
#> <chr>
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#>
#> [[5]]
#> # A tibble: 1 x 1
#> .notes
#> <chr>
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#>
#> [[6]]
#> # A tibble: 1 x 1
#> .notes
#> <chr>
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#>
#> [[7]]
#> # A tibble: 1 x 1
#> .notes
#> <chr>
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#>
#> [[8]]
#> # A tibble: 1 x 1
#> .notes
#> <chr>
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#>
#> [[9]]
#> # A tibble: 1 x 1
#> .notes
#> <chr>
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
#>
#> [[10]]
#> # A tibble: 1 x 1
#> .notes
#> <chr>
#> 1 recipe: Error: Can't use NA as column index with `[` at position 3.
sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.5
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] yardstick_0.0.7 workflows_0.1.2 tune_0.1.1 tidyr_1.1.1
#> [5] tibble_3.0.3 rsample_0.0.7 purrr_0.3.4 parsnip_0.1.2
#> [9] modeldata_0.0.2 infer_0.5.3 ggplot2_3.3.2 dials_0.0.8
#> [13] scales_1.1.1 broom_0.7.0 tidymodels_0.1.1 themis_0.1.2.9000
#> [17] recipes_0.1.13 dplyr_1.0.2
#>
#> loaded via a namespace (and not attached):
#> [1] lubridate_1.7.9 doParallel_1.0.15 DiceDesign_1.8-1 tools_4.0.2
#> [5] backports_1.1.8 utf8_1.1.4 R6_2.4.1 rpart_4.1-15
#> [9] colorspace_1.4-1 nnet_7.3-14 withr_2.2.0 tidyselect_1.1.0
#> [13] compiler_4.0.2 parallelMap_1.5.0 cli_2.0.2 checkmate_2.0.0
#> [17] stringr_1.4.0 digest_0.6.25 rmarkdown_2.3 unbalanced_2.0
#> [21] pkgconfig_2.0.3 htmltools_0.5.0 lhs_1.0.2 highr_0.8
#> [25] rlang_0.4.7 rstudioapi_0.11 BBmisc_1.11 FNN_1.1.3
#> [29] generics_0.0.2 magrittr_1.5 ROSE_0.0-3 Matrix_1.2-18
#> [33] Rcpp_1.0.5 munsell_0.5.0 fansi_0.4.1 GPfit_1.0-8
#> [37] lifecycle_0.2.0 furrr_0.1.0 stringi_1.4.6 pROC_1.16.2
#> [41] yaml_2.2.1 MASS_7.3-51.6 plyr_1.8.6 grid_4.0.2
#> [45] parallel_4.0.2 listenv_0.8.0 crayon_1.3.4 lattice_0.20-41
#> [49] splines_4.0.2 knitr_1.29 mlr_2.17.1 pillar_1.4.6
#> [53] xgboost_1.1.1.1 codetools_0.2-16 fastmatch_1.1-0 glue_1.4.1
#> [57] evaluate_0.14 ParamHelpers_1.14 data.table_1.13.0 vctrs_0.3.2
#> [61] foreach_1.5.0 gtable_0.3.0 RANN_2.6.1 future_1.18.0
#> [65] assertthat_0.2.1 xfun_0.15 gower_0.2.2 prodlim_2019.11.13
#> [69] class_7.3-17 survival_3.1-12 timeDate_3043.102 iterators_1.0.12
#> [73] lava_1.6.7 globals_0.12.5 ellipsis_0.3.1 ipred_0.9-9
Created on 2020-08-20 by the reprex package (v0.3.0)
If you don't prep the recipe before adding it to the workflow then it should work.
If you have any other problems, please add it into a new issue :)
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(themis)
#> Registered S3 methods overwritten by 'themis':
#> method from
#> bake.step_downsample recipes
#> bake.step_upsample recipes
#> prep.step_downsample recipes
#> prep.step_upsample recipes
#> tidy.step_downsample recipes
#> tidy.step_upsample recipes
#>
#> Attaching package: 'themis'
#> The following objects are masked from 'package:recipes':
#>
#> step_downsample, step_upsample, tunable.step_downsample,
#> tunable.step_upsample
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom 0.7.0 ✓ rsample 0.0.7
#> ✓ dials 0.0.8 ✓ tibble 3.0.3
#> ✓ ggplot2 3.3.2 ✓ tidyr 1.1.1
#> ✓ infer 0.5.3 ✓ tune 0.1.1
#> ✓ modeldata 0.0.2 ✓ workflows 0.1.3
#> ✓ parsnip 0.1.3 ✓ yardstick 0.0.7
#> ✓ purrr 0.3.4
#> ── Conflicts ──────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
#> x themis::step_downsample() masks recipes::step_downsample()
#> x themis::step_upsample() masks recipes::step_upsample()
set.seed(212)
circle_example2 <- themis::circle_example %>%
mutate(id = row_number(),
category = factor(case_when(class=='Rest' & y < 9 ~ 'Penguin',
class=='Rest' ~ 'Squirrel',
TRUE ~ 'Fox'))) %>%
as_tibble()
processed_recipe <-
recipe(class ~ ., data = circle_example2) %>%
update_role(id, new_role = "id") %>%
step_other(all_nominal(), -all_outcomes(), threshold = 0.05) %>%
step_dummy(all_predictors(),
-all_numeric(),
-all_outcomes(),
one_hot = TRUE) %>%
#step_nzv(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_rose(class, seed = 212) %>%
prep()
spec <- boost_tree(
trees = 500,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(),
mtry = tune(), ## randomness
learn_rate = tune() ## step size
) %>%
set_engine("xgboost") %>%
set_mode("classification") %>%
translate()
# grid specification
grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), circle_example2),
learn_rate(),
size = 30
)
set.seed(212)
circle_example2 <- themis::circle_example %>%
mutate(id = row_number(),
category = factor(case_when(class=='Rest' & y < 9 ~ 'Penguin',
class=='Rest' ~ 'Squirrel',
TRUE ~ 'Fox'))) %>%
as_tibble()
processed_recipe <-
recipe(class ~ ., data = circle_example2) %>%
update_role(id, new_role = "id") %>%
step_other(all_nominal(), -all_outcomes(), threshold = 0.05) %>%
step_dummy(all_predictors(),
-all_numeric(),
-all_outcomes(),
one_hot = TRUE) %>%
#step_nzv(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_rose(class, seed = 212)
spec <- boost_tree(
trees = 500,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(),
mtry = tune(), ## randomness
learn_rate = tune() ## step size
) %>%
set_engine("xgboost") %>%
set_mode("classification") %>%
translate()
# grid specification
grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), circle_example2),
learn_rate(),
size = 30
)
grid
#> # A tibble: 30 x 6
#> tree_depth min_n loss_reduction sample_size mtry learn_rate
#> <int> <int> <dbl> <dbl> <int> <dbl>
#> 1 3 10 0.349 0.335 5 5.23e- 8
#> 2 2 5 0.00000417 0.675 2 2.32e- 3
#> 3 13 3 0.000600 0.772 4 2.04e- 5
#> 4 7 17 0.0514 0.243 5 2.57e- 4
#> 5 9 4 0.000000000251 0.142 1 5.40e- 6
#> 6 11 6 0.00000000635 0.958 2 7.80e- 7
#> 7 13 30 0.0000000391 0.489 3 4.60e- 4
#> 8 15 9 0.0255 0.124 4 1.35e- 2
#> 9 3 28 0.000000714 1.00 3 7.70e-10
#> 10 10 25 5.60 0.593 3 2.79e- 7
#> # … with 20 more rows
Created on 2020-08-21 by the reprex package (v0.3.0)
thank you!
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
Currency steps in themis will use the whole dataset when upsampling and downsampling. This is not much of an issue in
step_upsample()
andstep_downsample()
. But it becomes an issue if you specify an ID variable and use steps such asstep_smote()
andstep_rose()
as they produce synthetic observations.As I see it there are 2 ways to deal with this
Something has to be done since the current implementation can introduce bias by letting the id variable be part of the space.
@juliasilge, @topepo any preferences? the second option seems nice but I want to make sure It doesn't break anything down the line.
Created on 2020-05-26 by the reprex package (v0.3.0)
ref: https://github.com/tidymodels/themis/issues/19