tidymodels / themis

Extra recipes steps for dealing with unbalanced data
https://themis.tidymodels.org/
Other
141 stars 11 forks source link

Error when running `step_bsmote` with a single predictor #151

Open koenniem opened 2 months ago

koenniem commented 2 months ago

The problem

When running step_bsmote() with only a single predictor, the function throws an error that a matrix cannot be created. This is due to themis:::bsmote_impl() at line 19: the data argument for smote_data() is given by subsetting data_mat with the values of min_class_in, but due to how the pesky subset [ works, the matrix is simplified to a vector in the case of only a single column. Thus, running step_bsmote() with a single predictor always throws this error.

The fix is by specifying drop = FALSE when subsetting data_mat, so that line 19 becomes:

tmp_df <- as.data.frame(smote_data(data = data_mat[min_class_in, , drop = FALSE], k = k, n_samples = samples_needed[i], smote_ids = which(danger_ids[min_class_in])))

Reproducible example

library(tidymodels)
library(themis)

recipe(class ~ compounds, data = hpc_data) |> 
  step_bsmote(all_outcomes(), all_neighbors = FALSE) |> 
  prep() |> 
  bake(NULL)
#> Error in `step_bsmote()`:
#> Caused by error in `matrix()`:
#> ! non-numeric matrix extent

Created on 2024-08-14 with reprex v2.1.1

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.4.1 (2024-06-14 ucrt) #> os Windows 10 x64 (build 19045) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United Kingdom.utf8 #> ctype English_United Kingdom.utf8 #> tz Europe/Brussels #> date 2024-08-14 #> pandoc 3.1.11 @ C:/Workdir/MyApps/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.5.0 2024-05-23 [1] CRAN (R 4.4.0) #> broom * 1.0.6 2024-05-17 [1] CRAN (R 4.4.0) #> class 7.3-22 2023-05-03 [1] CRAN (R 4.3.0) #> cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1) #> codetools 0.2-20 2024-03-31 [1] CRAN (R 4.3.3) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.2) #> data.table 1.15.4 2024-03-30 [1] CRAN (R 4.3.3) #> dials * 1.2.1 2024-02-22 [1] CRAN (R 4.3.2) #> DiceDesign 1.10 2023-12-07 [1] CRAN (R 4.3.2) #> digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.1) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2) #> evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.1) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.2) #> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0) #> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.1.3) #> fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0) #> furrr 0.3.1 2022-08-15 [1] CRAN (R 4.2.1) #> future 1.33.2 2024-03-26 [1] CRAN (R 4.3.3) #> future.apply 1.11.2 2024-03-28 [1] CRAN (R 4.3.3) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1) #> ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.3.3) #> globals 0.16.3 2024-03-08 [1] CRAN (R 4.3.3) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2) #> gower 1.0.1 2022-12-22 [1] CRAN (R 4.2.2) #> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.0.0) #> gtable 0.3.5 2024-04-22 [1] CRAN (R 4.3.3) #> hardhat 1.4.0 2024-06-02 [1] CRAN (R 4.4.1) #> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.3.3) #> infer * 1.0.7 2024-03-25 [1] CRAN (R 4.3.3) #> ipred 0.9-15 2024-07-18 [1] CRAN (R 4.4.0) #> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.1.3) #> knitr 1.48 2024-07-07 [1] CRAN (R 4.4.1) #> lattice 0.22-6 2024-03-20 [1] CRAN (R 4.3.3) #> lava 1.8.0 2024-03-05 [1] CRAN (R 4.3.3) #> lhs 1.2.0 2024-06-30 [1] CRAN (R 4.4.1) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> listenv 0.9.1 2024-01-29 [1] CRAN (R 4.3.2) #> lubridate 1.9.3 2023-09-27 [1] CRAN (R 4.3.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3) #> MASS 7.3-61 2024-06-13 [1] CRAN (R 4.4.1) #> Matrix 1.7-0 2024-03-22 [1] CRAN (R 4.4.0) #> modeldata * 1.4.0 2024-06-19 [1] CRAN (R 4.4.1) #> munsell 0.5.1 2024-04-01 [1] CRAN (R 4.3.3) #> nnet 7.3-19 2023-05-03 [1] CRAN (R 4.3.0) #> parallelly 1.37.1 2024-02-29 [1] CRAN (R 4.3.2) #> parsnip * 1.2.1 2024-03-22 [1] CRAN (R 4.3.3) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0) #> prodlim 2024.06.25 2024-06-24 [1] CRAN (R 4.4.1) #> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1) #> RANN 2.6.1 2019-01-08 [1] CRAN (R 4.0.0) #> Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.0) #> recipes * 1.1.0 2024-07-04 [1] CRAN (R 4.4.1) #> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.4.1) #> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.1) #> rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0) #> ROSE 0.0-4 2021-06-14 [1] CRAN (R 4.3.3) #> rpart 4.1.23 2023-12-05 [1] CRAN (R 4.3.2) #> rsample * 1.2.1 2024-03-25 [1] CRAN (R 4.3.3) #> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.3) #> scales * 1.3.0 2023-11-28 [1] CRAN (R 4.3.2) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2) #> survival 3.7-0 2024-06-05 [1] CRAN (R 4.4.1) #> themis * 1.0.2 2023-08-14 [1] CRAN (R 4.3.3) #> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.3) #> tidymodels * 1.2.0 2024-03-25 [1] CRAN (R 4.3.3) #> tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.2) #> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.3) #> timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.2) #> timeDate 4032.109 2023-12-14 [1] CRAN (R 4.3.2) #> tune * 1.2.1 2024-04-18 [1] CRAN (R 4.3.3) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.2) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.2) #> workflows * 1.1.4 2024-02-19 [1] CRAN (R 4.3.2) #> workflowsets * 1.1.0 2024-03-21 [1] CRAN (R 4.3.3) #> xfun 0.46 2024-07-18 [1] CRAN (R 4.4.0) #> yaml 2.3.9 2024-07-05 [1] CRAN (R 4.4.1) #> yardstick * 1.3.1 2024-03-21 [1] CRAN (R 4.3.3) #> #> [1] C:/Workdir/MyApps/R-Library/4.0 #> [2] C:/Workdir/MyApps/R/R-4.4.1/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
EmilHvitfeldt commented 2 months ago

Thank you for reporting!