topepo / C5.0

An R package for fitting Quinlan's C5.0 classification model
https://topepo.github.io/C5.0/
50 stars 20 forks source link

C5.0 fails with commas in input variables #12

Open CarolynOlsen opened 6 years ago

CarolynOlsen commented 6 years ago

C5.0() now fails on factor variables that include commas, where it did not before.

I recently updated my version of C50, and tried to train a model on a data set I've trained C5.0 models on before. I now receive the error "c50 code called exit with value 1". I narrowed it down to one factor variable that had commas in the values. After removing the commas, the model trained fine. Below is a small example I created to replicate the problem.

Thank you very much!

> ## PURPOSE: Replicate an error in C5.0 model training with commas
> 
> # define 2 different data frame, one with commas
> 
> # df no commas
> v1 = c(2, 3, 5, 7, 2, 4, 5, 2) 
> v2 = c("aa", "bb", "cc", "dd", "aa", "bb", "aa", "bb") 
> v3 = factor(c(1, 0, 0, 0, 1, 0, 1, 1) )
> dfNoCommas = data.frame(v1, v2, v3)
> 
> # df with commas
> v1 = c(2, 3, 5, 7, 2, 4, 5, 2) 
> v2 = c("a,a", "b,b", "c,c", "d,d", "a,a", "b,b", "a,a", "b,b") 
> v3 = factor(c(1, 0, 0, 0, 1, 0, 1, 1) )
> dfCommas = data.frame(v1, v2, v3)
> 
> # load C5 library
> library(C50)
> 
> # train a model with the no commas df
> trainNoCommas <- C5.0(formula = v3 ~ .
+      , data = dfNoCommas[,!colnames(dfNoCommas) %in% c("v3")]
+      , trials = 1
+      , rules = TRUE
+      , control = C5.0Control()
+ )
> 
> # train a model with the commas df
> trainCommas <- C5.0(formula = v3 ~ .
+                     , data = dfCommas[,!colnames(dfCommas) %in% c("v3")]
+                     , trials = 1
+                     , rules = TRUE
+                     , control = C5.0Control()
+ )
c50 code called exit with value 1
> 
> # see package versions
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] grid      parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RODBC_1.3-15      C50_0.1.1         AUC_0.3.0         adabag_4.2        pROC_1.11.0       smbinning_0.6     Formula_1.2-2     partykit_1.2-0   
 [9] rpart_4.1-11      mvtnorm_1.0-7     libcoin_1.0-1     sqldf_0.4-11      RSQLite_2.1.0     gsubfn_0.7        proto_1.0.0       stringr_1.3.0    
[17] caret_6.0-79      ggplot2_2.2.1     lattice_0.20-35   doParallel_1.0.11 iterators_1.0.9   foreach_1.4.4    

loaded via a namespace (and not attached):
[1] nlme_3.1-131        lubridate_1.7.2     bit64_0.9-7         dimRed_0.1.0        tools_3.4.3         R6_2.2.2            DBI_0.8            
 [8] lazyeval_0.2.1      colorspace_1.3-2    nnet_7.3-12         withr_2.1.1         tidyselect_0.2.4    mnormt_1.5-5        bit_1.1-12         
[15] compiler_3.4.3      chron_2.3-52        Cubist_0.2.1        scales_0.5.0        sfsmisc_1.1-2       DEoptimR_1.0-8      psych_1.7.8        
[22] robustbase_0.92-8   digest_0.6.15       foreign_0.8-69      pkgconfig_2.0.1     rlang_0.2.0         ddalpha_1.3.2       bindr_0.1          
[29] dplyr_0.7.4         ModelMetrics_1.1.0  magrittr_1.5        Matrix_1.2-12       Rcpp_0.12.15        munsell_0.4.3       abind_1.4-5        
[36] stringi_1.1.6       inum_1.0-0          MASS_7.3-47         plyr_1.8.4          recipes_0.1.2       blob_1.1.1          splines_3.4.3      
[43] pillar_1.2.1        tcltk_3.4.3         xgboost_0.6.4.1     reshape2_1.4.3      codetools_0.2-15    stats4_3.4.3        CVST_0.2-1         
[50] magic_1.5-8         glue_1.2.0          data.table_1.10.4-3 gtable_0.2.0        purrr_0.2.4         tidyr_0.8.0         kernlab_0.9-25     
[57] assertthat_0.2.0    DRR_0.0.3           gower_0.1.2         prodlim_1.6.1       broom_0.4.3         class_7.3-14        survival_2.41-3    
[64] geometry_0.3-6      timeDate_3043.102   RcppRoll_0.2.2      tibble_1.4.2        memoise_1.1.0       bindrcpp_0.2        lava_1.6.1         
[71] ipred_0.9-6     
topepo commented 6 years ago

This looks like a limitation in the C5.0 C code. You can escape other characters but I've been testing a bit and it doesn't accept this inside the data values.

You might dummy up some application files to verify. If it doesn't work, I'd email RuleQuest and see if Quinlan can make a change.

jjalcolea commented 5 years ago

Same problem here: had no problem before, but after upgrading, commas in variables break the training proccess :-( Will check escaping the commas and report back... (EDITED): Sorry, don't have time... I've downgraded with install_version("C50", version = "0.1.0-24", repos = "http://cran.us.r-project.org") to get the old comma-tolerant functionality...