topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 632 forks source link

Inconsistency between ROC value calculated by caret and pROC for resampled training data #944

Closed matifr closed 5 years ago

matifr commented 6 years ago

Hi, I want to generate ROC curves using the training data and resample results from the rfe function for the optimal subset size. I have managed to do this with the code below but there is some inconsistency between the mean ROC value that caret calculated and the one that I calculate with the proc package, i.e caret ROC = 0.8307 and pROC ROC = 0.8287

I cant figure out why this is happening, is this a bug in my code or the packages calculate ROC in a different way?

On my own dataset (not shown here), the difference is bigger i.e. caret-ROC 92.2%, pROC-ROC 89.15%.

Thanks a lot in advance! Matina

Minimal, runnable code:

library(mlbench)
data(PimaIndiansDiabetes)
require(caret)
require(pROC)

rfFuncs$summary <- twoClassSummary
ctrl <- rfeControl(functions = rfFuncs,
                   method = "repeatedcv",
                   number = 10,
                   repeats = 3,
                   verbose = TRUE,
                   saveDetails = TRUE,
                   returnResamp = "all")

trainctrl <- trainControl(classProbs= TRUE,
                          verboseIter = TRUE,
                          summaryFunction = twoClassSummary, 
                          method = "cv", 
                          number = 10 ,
                          returnResamp = "final", 
                          returnData = TRUE)
set.seed(12)
tunegrid <- expand.grid(.mtry=c(1:10))
rfe_rf <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8),
           method="rf",
           rfeControl = ctrl, 
           metric = "ROC", 
           trControl = trainctrl,
           tuneGrid = tunegrid,
           preProc = c("center", "scale"))
>rfe_rf

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 3 times) 

Resampling performance over subset size:

 Variables    ROC   Sens   Spec   ROCSD  SensSD  SpecSD Selected
         1 0.7256 0.8680 0.3914 0.05285 0.06025 0.11233         
         2 0.7695 0.8380 0.5523 0.05231 0.04649 0.09403         
         3 0.8009 0.8367 0.5759 0.04652 0.06036 0.07620         
         4 0.8111 0.8313 0.6096 0.04365 0.05698 0.07724         
         5 0.8196 0.8353 0.5958 0.04531 0.05551 0.08942         
         6 0.8258 0.8513 0.5997 0.04613 0.05399 0.09011         
         7 0.8274 0.8467 0.5897 0.04363 0.05973 0.09633         
         8 0.8307 0.8620 0.6132 0.04679 0.05235 0.10762        *

The top 5 variables (out of 8):
   glucose, mass, age, pregnant, insulin

selectedIndices <- rfe_rf$pred$Variables == rfe_rf$optsize
require(pROC)
ROC = plot.roc(rfe_rf$pred$obs[selectedIndices],
         rfe_rf$pred$neg[selectedIndices], legacy.axes = TRUE)

>ROC

Call:
plot.roc.default(x = rfe_rf$pred$obs[selectedIndices], predictor = rfe_rf$pred$neg[selectedIndices],     legacy.axes = TRUE)

Data: rfe_rf$pred$neg[selectedIndices] in 1500 controls (rfe_rf$pred$obs[selectedIndices] neg) > 804 cases (rfe_rf$pred$obs[selectedIndices] pos).
Area under the curve: 0.8287

I can reproduce the ROC given by caret by averaging the ROC values from each resample for the optimal subset.

mean(rfe_rf$resample$ROC[which(rfe_rf$resample$Variables == 8)])
[1] 0.830746
topepo commented 5 years ago

This line is getting a single area under the ROC curve pooling across resamples (2304 total rows of data).

ROC = plot.roc(rfe_rf$pred$obs[selectedIndices],
               rfe_rf$pred$neg[selectedIndices], legacy.axes = TRUE)

This line is taking the average of the 30 resampled area under the ROC curve estimates.

mean(rfe_rf$resample$ROC[which(rfe_rf$resample$Variables == 8)])

They should be "close" but would only the same in rare circumstances (e.g. balanced data over folds and a linear performance metric).