topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

Simulated annealing of a linear SVN model using ROC as metric fails #1322

Closed ning-y closed 1 year ago

ning-y commented 1 year ago

Minimal dataset:

library(tidyverse)
data(iris)

iris_2class <- iris %>%
  mutate(Species=ifelse(Species=="setosa", 1, 0))
iris_predictors <- iris_2class %>%
  select(-Species) %>% as.matrix()
iris_outcomes <- iris_2class %>%
  pull(Species)

Minimal, runnable code:

library(caret)
sa_ctrl <- safsControl(
  functions=caretSA, method="cv", number=10, p=0.75,
  metric=c(internal="Accuracy", external="Accuracy"),
  maximize=c(internal=TRUE, external=TRUE),
  improve=20, seeds=137+(0:10),
  returnResamp="all",
  verbose=TRUE)

tr_ctrl <- trainControl(classProbs=TRUE, summaryFunction=twoClassSummary)

safs(
  x=iris_predictors, y=iris_outcomes,
  method="svmLinear", metric="ROC", iters=5,
  safsControl=sa_ctrl, trControl=tr_ctrl)

Error output:

Error in { : 
  task 1 failed - "train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()"
Calls: safs -> safs.default -> %op% -> <Anonymous>
In addition: There were 21 warnings (use warnings() to see them)
Execution halted

Warning output

1: In safs.default(x = iris_predictors, y = iris_outcomes,  ... :
  The metric 'Accuracy' is not created by the external summary function; 'RMSE' will be used instead
2: In train.default(x, y, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
3: In train.default(x, y, ...) :
  cannnot compute class probabilities for regression
4: In train.default(x, y, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
5: In train.default(x, y, ...) :
  cannnot compute class probabilities for regression
6: In train.default(x, y, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
7: In train.default(x, y, ...) :
  cannnot compute class probabilities for regression
8: In train.default(x, y, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
9: In train.default(x, y, ...) :
  cannnot compute class probabilities for regression
10: In train.default(x, y, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
11: In train.default(x, y, ...) :
  cannnot compute class probabilities for regression
12: In train.default(x, y, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
13: In train.default(x, y, ...) :
  cannnot compute class probabilities for regression
14: In train.default(x, y, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
15: In train.default(x, y, ...) :
  cannnot compute class probabilities for regression
16: In train.default(x, y, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
17: In train.default(x, y, ...) :
  cannnot compute class probabilities for regression
18: In train.default(x, y, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
19: In train.default(x, y, ...) :
  cannnot compute class probabilities for regression
20: In train.default(x, y, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
21: In train.default(x, y, ...) :
  cannnot compute class probabilities for regression

Session Info:

>sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS/LAPACK: /home/molecularonco/miniconda3/envs/caret/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_SG.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_SG.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_SG.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_SG.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.2   stringr_1.5.0   dplyr_1.0.10    purrr_0.3.5    
 [5] readr_2.1.3     tidyr_1.2.1     tibble_3.1.8    tidyverse_1.3.2
 [9] caret_6.0-93    lattice_0.20-45 ggplot2_3.4.0   devtools_2.4.5 
[13] usethis_2.1.6  

loaded via a namespace (and not attached):
  [1] nlme_3.1-160         fs_1.5.2             lubridate_1.9.0     
  [4] httr_1.4.4           rprojroot_2.0.3      backports_1.4.1     
  [7] tools_4.2.2          profvis_0.3.7        utf8_1.2.2          
 [10] R6_2.5.1             rpart_4.1.19         DBI_1.1.3           
 [13] colorspace_2.0-3     nnet_7.3-18          urlchecker_1.0.1    
 [16] withr_2.5.0          tidyselect_1.2.0     prettyunits_1.1.1   
 [19] processx_3.8.0       curl_4.3.3           compiler_4.2.2      
 [22] rvest_1.0.3          cli_3.4.1            xml2_1.3.3          
 [25] desc_1.4.2           scales_1.2.1         callr_3.7.3         
 [28] digest_0.6.30        pkgconfig_2.0.3      htmltools_0.5.3     
 [31] parallelly_1.32.1    sessioninfo_1.2.2    dbplyr_2.2.1        
 [34] fastmap_1.1.0        readxl_1.4.1         htmlwidgets_1.5.4   
 [37] rlang_1.0.6          shiny_1.7.3          generics_0.1.3      
 [40] jsonlite_1.8.3       googlesheets4_1.0.1  ModelMetrics_1.2.2.2
 [43] magrittr_2.0.3       Matrix_1.5-3         Rcpp_1.0.9          
 [46] munsell_0.5.0        fansi_1.0.3          lifecycle_1.0.3     
 [49] stringi_1.7.8        pROC_1.18.0          MASS_7.3-58.1       
 [52] pkgbuild_1.4.0       plyr_1.8.8           recipes_1.0.3       
 [55] grid_4.2.2           parallel_4.2.2       listenv_0.8.0       
 [58] promises_1.2.0.1     crayon_1.5.2         miniUI_0.1.1.1      
 [61] haven_2.5.1          splines_4.2.2        hms_1.1.2           
 [64] ps_1.7.2             pillar_1.8.1         future.apply_1.10.0 
 [67] reshape2_1.4.4       codetools_0.2-18     stats4_4.2.2        
 [70] pkgload_1.3.2        reprex_2.0.2         glue_1.6.2          
 [73] modelr_0.1.10        data.table_1.14.6    remotes_2.4.2       
 [76] tzdb_0.3.0           vctrs_0.5.1          httpuv_1.6.6        
 [79] foreach_1.5.2        cellranger_1.1.0     gtable_0.3.1        
 [82] future_1.29.0        assertthat_0.2.1     cachem_1.0.6        
 [85] gower_1.0.0          mime_0.12            prodlim_2019.11.13  
 [88] xtable_1.8-4         broom_1.0.1          later_1.3.0         
 [91] googledrive_2.0.0    class_7.3-20         survival_3.4-0      
 [94] gargle_1.2.1         timeDate_4021.106    iterators_1.0.14    
 [97] memoise_2.0.1        hardhat_1.2.0        lava_1.7.0          
[100] timechange_0.1.1     globals_0.16.2       ellipsis_0.3.2      
[103] ipred_0.9-13
ning-y commented 1 year ago

I have cross-posted this on StackExchange's Cross Validated.

ning-y commented 1 year ago

Using the same example as above, if I change iris_outcomes into a factor, I get a different error.

Change iris_outcomes into a factor

iris_outcomes <- ifelse(iris_outcomes==1, "A", "B") %>% as.factor()

A different error appears

> safs(
  x=iris_predictors, y=iris_outcomes,
  method="svmLinear", metric="ROC", iters=5,
  safsControl=sa_ctrl, trControl=tr_ctrl)
Fold01 1 NA (1)
Fold01 2 NA->NA (1+1, 50.0%)Fold02 1 NA (1)
Fold02 2 NA->NA (1+1, 50.0%)Fold03 1 NA (1)
Fold03 2 NA->NA (1+1, 50.0%)Fold04 1 NA (1)
Fold04 2 NA->NA (1+1, 50.0%)Fold05 1 NA (1)
Fold05 2 NA->NA (1+0, 0.0%)Fold06 1 NA (1)
Fold06 2 NA->NA (1+1, 50.0%)Fold07 1 NA (1)
Fold07 2 NA->NA (1+0, 0.0%)Fold08 1 NA (1)
Fold08 2 NA->NA (1+0, 100.0%)Fold09 1 NA (1)
Fold09 2 NA->NA (1+1, 50.0%)Fold10 1 NA (1)
Fold10 2 NA->NA (1+1, 50.0%)Error in { : task 1 failed - "missing value where TRUE/FALSE needed"
> traceback()
5: stop(simpleError(msg, call = expr))
4: e$fun(obj, substitute(ex), parent.frame(), e$data)
3: foreach(i = seq(along = safsControl$index), .combine = "c", .verbose = FALSE, 
       .errorhandling = "stop") %op% {
       sa_select(x[safsControl$index[[i]], , drop = FALSE], y[safsControl$index[[i]]], 
           funcs = safsControl$functions, sa_metric = safsControl$metric, 
           sa_maximize = safsControl$maximize, iters = iters, sa_verbose = safsControl$verbose, 
           testX = x[safsControl$indexOut[[i]], , drop = FALSE], 
           testY = y[safsControl$indexOut[[i]]], sa_seed = safsControl$seeds[i], 
           improve = safsControl$improve, Resample = names(safsControl$index)[i], 
           holdout = safsControl$holdout, lvl = classLevels, ...)
   }
2: safs.default(x = iris_predictors, y = iris_outcomes, method = "svmLinear", 
       metric = "ROC", iters = 5, safsControl = sa_ctrl, trControl = tr_ctrl)
1: safs(x = iris_predictors, y = iris_outcomes, method = "svmLinear", 
       metric = "ROC", iters = 5, safsControl = sa_ctrl, trControl = tr_ctrl)
ning-y commented 1 year ago

My bad! This error is due to a mismatch in the arguments of safs and the value of its list argument safsControl.

I had written:

sa_ctrl <- safsControl(
  functions=caretSA, method="cv", number=10, p=0.75,
  # Oops!
  metric=c(internal="Accuracy", external="Accuracy"),
  maximize=c(internal=TRUE, external=TRUE),
  improve=20, seeds=137+(0:10),
  returnResamp="all",
  verbose=TRUE)
safs(
  x=iris_predictors, y=iris_outcomes,
  method="svmLinear", metric="ROC", iters=5,
  safsControl=sa_ctrl, trControl=tr_ctrl)

But clearly the safsControl call should have metric=c(internal="ROC", external="ROC") instead. Making this edit fixed the error.

But this only gives ROC for the internal performance metric. In order to enable ROC for the external performance metric, I had to assign twoSummaryFunction as sa_ctrl$functions$fitness_extern:

sa_ctrl$functions$fitness_extern <- twoClassSummary

Making these two changes solved the issue.