topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 633 forks source link

Various problems in multi-class classification problem #726

Closed ogreyesp closed 7 years ago

ogreyesp commented 7 years ago

Hi,

Iḿ have a multi-class problem and I want to conduct a multiple comparison of several learning algorithms on it. I get several problems with various methods. This is my minimal reproducible example:

library(caret)

library(doMC) registerDoMC(cores = 4)

data("iris") set.seed(825)

To create a stratified repeated k-fold cross validation

multiIndexes<-createMultiFolds(y=iris$Species, k = 10, times = 3)

fitControl <- trainControl(method="repeatedcv", number=10, repeats = 1, index = multiIndexes, classProbs=TRUE, savePredictions = TRUE, search="grid", allowParallel= TRUE, summaryFunction = multiClassSummary, verboseIter = FALSE)

execute the algorithm

modelFit <- train(Species ~ ., data = iris, method= "rf", metric = "AUC", maximize = TRUE, tuneLength = 5, trControl = fitControl)

This is my session info:


R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu Artful Aardvark (development branch)

Matrix products: default BLAS: /home/oscar/anaconda3/lib/R/lib/libRblas.so LAPACK: /home/oscar/anaconda3/lib/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=es_ES.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] randomForest_4.6-12 doMC_1.3.4 iterators_1.0.8 foreach_1.4.3 caret_6.0-76 ggplot2_2.2.1
[7] lattice_0.20-35

loaded via a namespace (and not attached): [1] Rcpp_0.12.11 magrittr_1.5 splines_3.4.1 MASS_7.3-47 munsell_0.4.3 colorspace_1.3-2
[7] rlang_0.1.1 minqa_1.2.4 stringr_1.2.0 car_2.1-4 plyr_1.8.4 tools_3.4.1
[13] nnet_7.3-12 pbkrtest_0.4-7 grid_3.4.1 gtable_0.2.0 nlme_3.1-131 mgcv_1.8-17
[19] quantreg_5.33 e1071_1.6-8 class_7.3-14 MatrixModels_0.4-1 lme4_1.1-13 lazyeval_0.2.0
[25] tibble_1.3.3 Matrix_1.2-10 nloptr_1.0.4 reshape2_1.4.2 ModelMetrics_1.1.0 codetools_0.2-15
[31] stringi_1.1.5 compiler_3.4.1 scales_0.4.1 stats4_3.4.1 SparseM_1.77

After running this MRE with several methods, the results are as follows:

I would appreciate your help.

Thanks in advance,

Oscar

topepo commented 7 years ago

Some notes:

fitControl <- trainControl(method="repeatedcv", number=10, repeats = 1,


one with three replicates another with a single replicate. 

* `ordinalNet` is for ordinal data. 

* For `glmboost`, you may have noticed the error message: `  response is not a factor at two levels but ‘family = Binomial()’`. Basically, it does not do 3 class problems. This is also true for several other models in you list (e.g. the oblique random forest methods that you list). See [the list of two-class models](https://topepo.github.io/caret/train-models-by-tag.html#Two_Class_Only). I can put more specific error checks in (although this reduces computational efficiency).

I'm going to close this. The reproducible examples are very helpful but please submit separate issues for specific models once you have thoroughly made sure that the issue is with the software.