tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
561 stars 112 forks source link

`step_bagimpute` fails with nzv character columns #209

Open glenrs opened 5 years ago

glenrs commented 5 years ago

step_bagimpute crashes when a nzv character column is given.

The following example crashes when only one value is in the character column, but works if 2 values are present. Numeric columns are not affected.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: broom
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

d <- data.frame(let = c(rep("a", 99), NA), num = 1:100)
rec_obj <- d %>%
  recipes::recipe(formula = "~.") %>%
  recipes::step_bagimpute("let")

recipes::prep(rec_obj)
#> Error in cbind(yval2, yprob, nodeprob): number of rows of matrices must match (see arg 2)
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] recipes_0.1.3 broom_0.5.0   dplyr_0.7.6  
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.18       ddalpha_1.3.4      gower_0.1.2       
#>  [4] pillar_1.3.0       compiler_3.5.1     DEoptimR_1.0-8    
#>  [7] bindr_0.1.1        magic_1.5-8        class_7.3-14      
#> [10] tools_3.5.1        rpart_4.1-13       ipred_0.9-7       
#> [13] digest_0.6.16      lubridate_1.7.4    evaluate_0.11     
#> [16] tibble_1.4.2       nlme_3.1-137       lattice_0.20-35   
#> [19] pkgconfig_2.0.2    rlang_0.2.2        Matrix_1.2-14     
#> [22] yaml_2.2.0         RcppRoll_0.3.0     prodlim_2018.04.18
#> [25] bindrcpp_0.2.2     stringr_1.3.1      knitr_1.20        
#> [28] nnet_7.3-12        CVST_0.2-2         rprojroot_1.3-2   
#> [31] grid_3.5.1         tidyselect_0.2.4   glue_1.3.0        
#> [34] robustbase_0.93-2  R6_2.2.2           survival_2.42-6   
#> [37] rmarkdown_1.10     lava_1.6.3         kernlab_0.9-27    
#> [40] DRR_0.0.3          purrr_0.2.5        tidyr_0.8.1       
#> [43] magrittr_1.5       pls_2.7-0          splines_3.5.1     
#> [46] sfsmisc_1.1-2      backports_1.1.2    htmltools_0.3.6   
#> [49] MASS_7.3-50        dimRed_0.1.0       abind_1.4-5       
#> [52] assertthat_0.2.0   timeDate_3043.102  stringi_1.2.4     
#> [55] geometry_0.3-6     crayon_1.3.4

Created on 2018-10-02 by the reprex package (v0.2.0).

topepo commented 5 years ago

The issue is that the variable has a single value so most models could be used. The best we can do is to throw a more meaningful error.

Using step_modeimpute would be a better choice.

glenrs commented 5 years ago

I agree that step_modeimpute would be a better choice in the circumstance above. A more meaningful error would be helpful. Thank you.

In wide datasets this could be a larger problem if one of the character columns only has one variable. It would be a pitty to simply use mode imputation for all nominal columns because of one feature.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: broom
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(healthcareai) ## This is included to provide pima_diabetes data
#> healthcareai version 2.2.0
#> Please visit https://docs.healthcare.ai for full documentation and vignettes. Join the community at https://healthcare-ai.slack.com

d <- data.frame(let = c(rep("a", 767), NA), num = 1:768, stringsAsFactors = FALSE)
d <- 
  d %>%
  cbind(pima_diabetes)

rec_obj <- 
  d %>%
  recipe(formula = "~.") %>%
  step_bagimpute(all_nominal())

prep(rec_obj)
#> Error in cbind(yval2, yprob, nodeprob): number of rows of matrices must match (see arg 2)

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] healthcareai_2.2.0 recipes_0.1.3      broom_0.5.0       
#> [4] dplyr_0.7.6       
#> 
#> loaded via a namespace (and not attached):
#>  [1] magic_1.5-8        ddalpha_1.3.4      tidyr_0.8.1       
#>  [4] sfsmisc_1.1-2      splines_3.5.1      foreach_1.4.4     
#>  [7] prodlim_2018.04.18 gtools_3.8.1       assertthat_0.2.0  
#> [10] stats4_3.5.1       DRR_0.0.3          yaml_2.2.0        
#> [13] robustbase_0.93-2  ipred_0.9-7        pillar_1.3.0      
#> [16] backports_1.1.2    lattice_0.20-35    glue_1.3.0        
#> [19] MLmetrics_1.1.1    digest_0.6.16      colorspace_1.3-2  
#> [22] cowplot_0.9.3      htmltools_0.3.6    Matrix_1.2-14     
#> [25] plyr_1.8.4         timeDate_3043.102  pkgconfig_2.0.2   
#> [28] CVST_0.2-2         caret_6.0-80       purrr_0.2.5       
#> [31] scales_1.0.0       ranger_0.10.1      gdata_2.18.0      
#> [34] gower_0.1.2        lava_1.6.3         tibble_1.4.2      
#> [37] ggplot2_3.0.0      xgboost_0.71.2     withr_2.1.2       
#> [40] ROCR_1.0-7         nnet_7.3-12        lazyeval_0.2.1    
#> [43] survival_2.42-6    magrittr_1.5       crayon_1.3.4      
#> [46] evaluate_0.11      nlme_3.1-137       MASS_7.3-50       
#> [49] gplots_3.0.1       dimRed_0.1.0       class_7.3-14      
#> [52] data.table_1.11.4  tools_3.5.1        stringr_1.3.1     
#> [55] kernlab_0.9-27     glmnet_2.0-16      munsell_0.5.0     
#> [58] bindrcpp_0.2.2     e1071_1.7-0        pls_2.7-0         
#> [61] compiler_3.5.1     RcppRoll_0.3.0     caTools_1.17.1.1  
#> [64] rlang_0.2.2        grid_3.5.1         iterators_1.0.10  
#> [67] bitops_1.0-6       rmarkdown_1.10     ModelMetrics_1.2.0
#> [70] geometry_0.3-6     gtable_0.2.0       codetools_0.2-15  
#> [73] DBI_1.0.0          abind_1.4-5        reshape2_1.4.3    
#> [76] R6_2.2.2           lubridate_1.7.4    knitr_1.20        
#> [79] bindr_0.1.1        rprojroot_1.3-2    KernSmooth_2.23-15
#> [82] stringi_1.2.4      Rcpp_0.12.18       rpart_4.1-13      
#> [85] dbplyr_1.2.2       DEoptimR_1.0-8     tidyselect_0.2.4

Created on 2018-10-04 by the reprex package (v0.2.0).