topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 632 forks source link

Problem using caret train function #1040

Closed micpesce closed 5 years ago

micpesce commented 5 years ago

This is my issue performing knn caret train function :

df<- readRDS(gzcon(url("https://github.com/micpesce/german_credit/blob/master/gc_train.rds?raw=true")))
#in case link does not work, the same file is in the uploaded zipped  df.zip 

li <- which(names(df)=="credit_response") #to get the label variable index
pred <- df[,-li] # The predictors train data set
outcome <- df$credit_response

fit_knn <- knn3(pred, outcome,  k = 5) #it works
train_rf <- train(pred,outcome, method = "rf", data = df) #it works
train_knn <- train(pred,outcome, method = "knn", data = df) #it gives follwing errors:

"Something is wrong; all the Accuracy metric values are missing: Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :3 NA's :3
Error: Stopping In addition: There were 50 or more warnings (use warnings() to see the first 50)" 1: predictions failed for Resample01: k=5 Error in knn3Train(train = structure(c("A11", "A12", "A14", "A12", "A12", : unused argument (data = list(checking_account = c(1, 2, 4, 1, 4, 2, 4, 2, 1, 4, 1, 4, 4, 1, 1, 2, 4, 1, 4, 2, 1, 2, 2, 4, 3, 2, 2, 4, 2, 2, 1, 1, 4, 4, 1, 4, 2, 4, 4, 2, 4, 2, 3, 2, 2, 4, 4, 2, 4, 4, 4, 2, 1, 1, 1, 2, 4, 2, 4, 4, 4, 4, 2, 1, 4, 1, 4, 2, 4, 4, 2, 4, 2, 2, 4, 2, 1, 2, 2, 3, 2, 4, 4, 1, 1, 2, 4, 3, 2, 1, 2, 1, 4, 4, 2, 2, 3, 2, 1, 1, 2, 1, 1, 4, 4, 3, 1, 1, 1, 2, 4, 4, 4, 2, 4, 1, 2, 2, 4, 1, 4, 1, 4, 2, 4, 2, 4, 2, 2, 2, 2, 2, 4, 2, 2, 4, 2, 2, 2, 1, 4, 4, 1, 4, 2, 1, 4, 4, 4, 2, 1, 3, 1, 4, 2, 1, 4, 4, 4, 2, 4, 1, 3, 4, 4, 2, 1, 2, 2, 4, 1, 4, 4, 4, 4, 4, 4, 4, 2, 4, 1, 4, 1, 1, 4, 2, 4, 4, 1, 4, 4, 2, 2, 1, 4, 1, 4, 4, 4, 3, 2, 1, 2, 4, 2, 1, 4, 2, 4, 4, 4, 2, 1, 4, 4, 2, 3, 2, 3, 1, 4, 1, 2, 1, 1, 4, 4, 4, 3, 2, 1, 4, 1, 2, 1, 2, 1, 2, 2, 3, 2, 2, 4, 1, 4, 4, 4, 1, 3, 3, 4, 1, 4, 1, 2, 4, 4, 4, 1, 4, 4, 1, 2, 4, 2, 4, 2, 1, 1, 4, 1, 2, 4, 2, 4, 4, 2, [... truncated]

Session Info:

>sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252   
[3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C                  
[5] LC_TIME=Italian_Italy.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tm_0.7-6        NLP_0.2-0       miscset_1.1.0   readtext_0.74   gdata_2.18.0   
 [6] caret_6.0-84    lattice_0.20-38 ggplot2_3.1.1   tidyr_0.8.3     stringr_1.4.0  
[11] dplyr_0.8.1    

loaded via a namespace (and not attached):
 [1] httr_1.4.0         pkgload_1.0.2      splines_3.6.0      foreach_1.4.4     
 [5] prodlim_2018.04.18 gtools_3.8.1       Formula_1.2-3      assertthat_0.2.1  
 [9] stats4_3.6.0       slam_0.1-45        remotes_2.0.4      sessioninfo_1.1.1 
[13] stabs_0.6-3        ipred_0.9-9        pillar_1.4.1       backports_1.1.4   
[17] glue_1.3.1         quadprog_1.5-7     mboost_2.9-1       digest_0.6.19     
[21] colorspace_1.4-1   recipes_0.1.5      Matrix_1.2-17      plyr_1.8.4        
[25] timeDate_3043.102  pkgconfig_2.0.2    devtools_2.0.2     naivebayes_0.9.5  
[29] xtable_1.8-4       purrr_0.3.2        mvtnorm_1.0-10     scales_1.0.0      
[33] processx_3.3.1     gower_0.2.1        lava_1.6.5         tibble_2.1.2      
[37] generics_0.0.2     usethis_1.5.0      withr_2.1.2        nnet_7.3-12       
[41] lazyeval_0.2.2     cli_1.1.0          survival_2.44-1.1  magrittr_1.5      
[45] crayon_1.3.4       memoise_1.1.0      ps_1.3.0           fs_1.3.1          
[49] fansi_0.4.0        nlme_3.1-139       MASS_7.3-51.4      xml2_1.2.0        
[53] class_7.3-15       pkgbuild_1.0.3     tools_3.6.0        data.table_1.12.2 
[57] prettyunits_1.0.2  kernlab_0.9-27     munsell_0.5.0      callr_3.2.0       
[61] compiler_3.6.0     e1071_1.7-1        inum_1.0-1         rlang_0.3.4       
[65] grid_3.6.0         iterators_1.0.10   partykit_1.2-4     gtable_0.3.0      
[69] ModelMetrics_1.2.2 codetools_0.2-16   reshape2_1.4.3     R6_2.4.0          
[73] gridExtra_2.3      nnls_1.4           lubridate_1.7.4    utf8_1.1.4        
[77] zeallot_0.1.0      libcoin_1.0-4      rprojroot_1.3-2    desc_1.2.0        
[81] stringi_1.4.3      parallel_3.6.0     Rcpp_1.0.1         import_1.1.0      
[85] vctrs_0.1.0        rpart_4.1-15       tidyselect_0.2.5   

[df.zip](https://github.com/topepo/caret/files/3240967/df.zip)
topepo commented 5 years ago

You have some non-numeric predictors:

> str(pred)
'data.frame':   700 obs. of  20 variables:
 $ checking_account : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 4 2 4 2 1 4 ...
 $ credit_duration  : num [1:700, 1] -1.236 2.247 -0.738 1.75 0.257 ...
 $ Credit_history   : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 3 3 3 5 3 5 ...
 $ purpose          : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 4 2 5 1 5 5 ...
 $ credit_amount    : num [1:700, 1] -0.745 0.949 -0.416 1.633 -0.155 ...
 $ savings_account  : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 3 1 4 1 2 5 ...
 $ employment_since : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 5 3 4 1 3 5 ...
 $ percentage_income: num [1:700, 1] 0.918 -0.8697 -0.8697 -0.8697 0.0241 ...
 $ personal_status  : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 1 4 2 3 ...
 $ other_guarantors : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
 $ residence        : num [1:700, 1] 1.046 -0.766 0.14 1.046 1.046 ...
 $ property         : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 2 3 1 3 3 2 ...
 $ age              : num [1:700, 1] 2.765 -1.191 1.183 0.831 1.534 ...
 $ other_plans      : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ housing          : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 2 1 2 2 2 2 ...
 $ existing_credits : num [1:700, 1] 1.027 -0.705 -0.705 -0.705 -0.705 ...
 $ job              : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 4 2 4 2 3 ...
 $ house_manteinant : num [1:700, 1] -0.428 -0.428 2.334 2.334 -0.428 ...
 $ telephone        : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 1 1 1 ...
 $ foreign_worker   : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...

The way to get around this is to use the formula method to train() so that these are converted to dummy variables. Also, I suggest centering and scaling the data so that the distance calculations are not skewed by the units of the predictors.

EDIT: I didn't see the attached data at first

micpesce commented 5 years ago

Thanks Max, prior to training, all variables, except categorical outcome, have been preprocessed as numeric and scaled/centered. It finally works in the form of : "train(pred, outcome,..) " instead of "train( outcome~ ., data=training) " anyway is OK. I think I was quite shallow because KNN would best fit on datasets with true numeric variables. other than non-hierarchical factors. Eventually categorical variables could be converted in k-1 binary values with k=number of factors, and then apply KNN.