tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
569 stars 112 forks source link

`has_role()` does not select columns for imputation in `step_impute_knn()` #1197

Open andreranza opened 1 year ago

andreranza commented 1 year ago

The problem

I'm having trouble selecting columns to impute within step_impute_knn() using has_role(). Thanks!

Reproducible example

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

df <- tibble::tibble(
  country_code = c("AGO", "BGD", "BRA", "CHN", "PRK"),
  GDP = c(6930.7687, 35263.802, 8159000.64, 8485748, 9868.7669),
  D = c(32353588, 165516222, 211782878, 1407745000, 25755441),
  A = c(167, 1136, 2463, 2951, 367),
  B = c(3, NA, 5, NA, 7),
  C = c(13, NA, 5, NA, 4)
)

# imputation works
recipe(GDP ~ ., data = df) |>
  step_impute_knn(
    c("B", "C"), 
    neighbors = 2, 
    impute_with = c("D", "A")
  ) |> 
  prep() |> 
  juice()
#> # A tibble: 5 × 6
#>   country_code          D     A     B     C      GDP
#>   <fct>             <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 AGO            32353588   167     3  13      6931.
#> 2 BGD           165516222  1136     5   8.5   35264.
#> 3 BRA           211782878  2463     5   5   8159001.
#> 4 CHN          1407745000  2951     6   4.5 8485748 
#> 5 PRK            25755441   367     7   4      9869.

# imputation does not work
recipe(GDP ~ ., data = df) |>
  add_role(D, new_role = "impute") |> 
  add_role(A, new_role = "impute") |> 
  step_impute_knn(
    c("B", "C"), 
    neighbors = 2, 
    impute_with = has_role("impute")
  ) |> 
  prep() |> 
  juice()
#> Warning: All predictors are missing; cannot impute
#> All predictors are missing; cannot impute
#> # A tibble: 5 × 6
#>   country_code          D     A     B     C      GDP
#>   <fct>             <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 AGO            32353588   167     3    13    6931.
#> 2 BGD           165516222  1136    NA    NA   35264.
#> 3 BRA           211782878  2463     5     5 8159001.
#> 4 CHN          1407745000  2951    NA    NA 8485748 
#> 5 PRK            25755441   367     7     4    9869.

Created on 2023-09-07 with reprex v2.0.2

Session info ``` r sessionInfo() #> R version 4.2.3 (2023-03-15) #> Platform: x86_64-apple-darwin17.0 (64-bit) #> Running under: macOS Big Sur ... 10.16 #> #> Matrix products: default #> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib #> #> locale: #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] recipes_1.0.8 dplyr_1.1.0 #> #> loaded via a namespace (and not attached): #> [1] styler_1.7.0 tidyselect_1.2.0 xfun_0.39 #> [4] purrr_1.0.1 listenv_0.9.0 splines_4.2.3 #> [7] lattice_0.20-45 vctrs_0.6.3 generics_0.1.3 #> [10] htmltools_0.5.4 yaml_2.3.7 utf8_1.2.3 #> [13] survival_3.5-3 prodlim_2023.08.28 rlang_1.1.1 #> [16] R.oo_1.25.0 pillar_1.9.0 glue_1.6.2 #> [19] withr_2.5.0 R.utils_2.12.0 R.cache_0.16.0 #> [22] lifecycle_1.0.3 lava_1.7.2.1 timeDate_4022.108 #> [25] R.methodsS3_1.8.2 future_1.33.0 codetools_0.2-19 #> [28] evaluate_0.21 knitr_1.43 fastmap_1.1.1 #> [31] parallel_4.2.3 class_7.3-21 fansi_1.0.4 #> [34] Rcpp_1.0.10 ipred_0.9-14 parallelly_1.36.0 #> [37] fs_1.6.2 digest_0.6.33 grid_4.2.3 #> [40] hardhat_1.3.0 cli_3.6.1 tools_4.2.3 #> [43] magrittr_2.0.3 tibble_3.2.1 future.apply_1.11.0 #> [46] pkgconfig_2.0.3 ellipsis_0.3.2 MASS_7.3-58.2 #> [49] Matrix_1.5-3 data.table_1.14.8 timechange_0.2.0 #> [52] lubridate_1.9.2 reprex_2.0.2 gower_1.0.1 #> [55] rmarkdown_2.23 rstudioapi_0.15.0 R6_2.5.1 #> [58] globals_0.16.2 rpart_4.1.19 nnet_7.3-18 #> [61] compiler_4.2.3 ```
EmilHvitfeldt commented 1 year ago

Hello @andreranza :wave: Thanks for the wonderful reprex!

As per the documentation for step_impute_knn.

You need to use the imp_vars() function to use selector functions such as has_role(). I want to be able to use has_role() directly in cases like this but it is not yet implemented.

library(recipes)

df <- tibble::tibble(
  country_code = c("AGO", "BGD", "BRA", "CHN", "PRK"),
  GDP = c(6930.7687, 35263.802, 8159000.64, 8485748, 9868.7669),
  D = c(32353588, 165516222, 211782878, 1407745000, 25755441),
  A = c(167, 1136, 2463, 2951, 367),
  B = c(3, NA, 5, NA, 7),
  C = c(13, NA, 5, NA, 4)
)

recipe(GDP ~ ., data = df) |>
  add_role(D, new_role = "impute") |> 
  add_role(A, new_role = "impute") |> 
  step_impute_knn(
    c("B", "C"), 
    neighbors = 2, 
    impute_with = imp_vars(has_role("impute"))
  ) |> 
  prep() |> 
  juice()
#> # A tibble: 5 × 6
#>   country_code          D     A     B     C      GDP
#>   <fct>             <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 AGO            32353588   167     3  13      6931.
#> 2 BGD           165516222  1136     5   8.5   35264.
#> 3 BRA           211782878  2463     5   5   8159001.
#> 4 CHN          1407745000  2951     6   4.5 8485748 
#> 5 PRK            25755441   367     7   4      9869.

Created on 2023-09-07 with reprex v2.0.2

andreranza commented 1 year ago

Wow, I definitely saw imp_vars(). Unsure why I didn't try that out 😅 I guess it felt so natural to use it without that it should have worked despite what the documentation was saying. Sorry and thanks a lot for pointing in the right direction!