topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 636 forks source link

More optimistic results from carettrainmodel$results then from confusionMatrix(model$pred$pred,model$pred$obs) #897

Closed t31415 closed 4 years ago

t31415 commented 6 years ago

I noticed that the results in my models obtained from caret:train , as in

fitCtrl<-trainControl(method='repeatedcv',num=10,repeats=5,savePredictions=TRUE,classProbs=TRUE,summaryFunction=BalAc) model<-train(Y~.,data=data1,metric='OptBalAcc',method='qda',trControl=fitCtrl) model$results

are much more optimistic than what I get from

confusionMatrix(model$pred$pred,model$pred$obs)

, where I had expected these to be the same. Or at least for the latter (which I thought were the predictions made by the best model when it is applied again to all of the folds together?) to be at least as good as the former (which I thought is the mean of the holdout performances for each fold/repeat and thus a little less subject to overfitting due to being held out each time when the fit takes place?)

So I'm trying to understand how its possible that Confusionmatrix is always the more pessimistic one, when it is just the training error made when the same best model is used to predict all the data? Or is this not the case? Thanks

topepo commented 6 years ago

Providing a reproducible example and the results of sessionInfo will help get your question answered.

t31415 commented 6 years ago

Hi, I've tried to make a simplified example but in the example dataset I found for imbalanced data I'm not able to reprocude the differences of 10% in balanced accuracy I get in my own dataset. The dataset I was working with is rather big and a lot of code was indirectly involved in the custom metrics etc, so I tried to come up with a simple example, but here the difference is a lot smaller. Here is a minimal example:

install.packages('imbalance')
library(imbalance)
#define a custom metric and its traincontrol:
BalAc<-function(data,lev=NULL,model=NULL,b=NULL){
  #just for compatibility with other we keep a useless b=.. argument
  sensvals<-performance(prediction(data$positive,data$obs),'sens')@y.values[[1]]
  specvals<-performance(prediction(data$positive,data$obs),'spec')@y.values[[1]]
  balacvals<-(sensvals+specvals)/2
  optcut<-performance(prediction(data$positive,data$obs),'sens')@x.values[[1]][which.max(balacvals)]
  optBA<-balacvals[which.max(balacvals)]
  out<-c(optBA,optcut)
  names(out) <- c("OptBalAcc", "OptCutoff")
  out
}
trainControlSumfun0<-function(method,number,repeats=NULL,sumfun,...){
  trainControl(method = method,
               number = number,repeats=repeats,savePredictions='final',
               classProbs=TRUE,sampling='smote',
               summaryFunction = function(data,lev=NULL,model=NULL){ sumfun(data,lev,model,...)})
}

fitControl<-trainControlSumfun0(method = 'repeatedcv',number = 10,repeats=5,
                                sumfun = BalAc,b=2) 

#Train a model
trainobj<-train(Class ~., data=newthyroid1,
                method='mda',metric='OptBalAcc',maximize=TRUE,trControl=fitControl,b=2)

#I would think that the following two would be roughly the same, or that at least the former would be less optimistic than the latter, but often it is the other way round:
trainobj$results
confusionMatrix(trainobj$pred$pred,trainobj$pred$obs)

(Here the difference was only 2%, but on my own dataset I often had much larger differences of up to 10%.) Does that difference mean something is wrong, or is it possible for the latter to be much less good than the former, and why is that? am I right in assuming that the former is a CV error, ie the average of all hold out errors, while the latter is a pure training error as it doesnt hold out the data it predicts and should therefore yield a higher (apparent) performance due to more overfitting?

Thanks,

PS: the output of 'sessioninfo' is the following:

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  grid      stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] glmnet_2.0-16     Matrix_1.2-14     doParallel_1.0.11 iterators_1.0.9  
 [5] foreach_1.4.4     imbalance_1.0.0   logicFS_1.50.0    mcbiopi_1.1.2    
 [9] LogicReg_1.5.9    survival_2.41-3   kernlab_0.9-26    mlbench_2.1-1    
[13] DMwR_0.4.1        praznik_5.0.0     pryr_0.1.4        caret_6.0-79     
[17] ggplot2_2.2.1     lattice_0.20-35   ROCR_1.0-7        gplots_3.0.1     

loaded via a namespace (and not attached):
 [1] nlme_3.1-137       bitops_1.0-6       xts_0.10-2         lubridate_1.7.4   
 [5] dimRed_0.1.0       C50_0.1.1          tools_3.5.0        R6_2.2.2          
 [9] rpart_4.1-13       KernSmooth_2.23-15 lazyeval_0.2.1     colorspace_1.3-2  
[13] nnet_7.3-12        withr_2.1.2        gbm_2.1.3          tidyselect_0.2.4  
[17] mnormt_1.5-5       curl_3.2           compiler_3.5.0     mda_0.4-10        
[21] Cubist_0.2.1       caTools_1.17.1     scales_0.5.0       sfsmisc_1.1-2     
[25] DEoptimR_1.0-8     mvtnorm_1.0-7      psych_1.8.4        robustbase_0.93-0 
[29] stringr_1.3.1      foreign_0.8-70     pkgconfig_2.0.1    rlang_0.2.0       
[33] TTR_0.23-3         ddalpha_1.3.3      quantmod_0.4-13    bindr_0.1.1       
[37] zoo_1.8-1          gtools_3.5.0       dplyr_0.7.4        ModelMetrics_1.1.0
[41] magrittr_1.5       Formula_1.2-3      Rcpp_0.12.16       munsell_0.4.3     
[45] abind_1.4-5        partykit_1.2-1     stringi_1.2.2      inum_1.0-0        
[49] MASS_7.3-49        plyr_1.8.4         recipes_0.1.2      gdata_2.18.0      
[53] splines_3.5.0      pillar_1.2.2       reshape2_1.4.3     codetools_0.2-15  
[57] stats4_3.5.0       CVST_0.2-1         magic_1.5-8        glue_1.2.0        
[61] gtable_0.2.0       purrr_0.2.4        tidyr_0.8.0        assertthat_0.2.0  
[65] DRR_0.0.3          gower_0.1.2        prodlim_2018.04.18 libcoin_1.0-1     
[69] broom_0.4.4        e1071_1.6-8        class_7.3-14       geometry_0.3-6    
[73] timeDate_3043.102  smotefamily_1.2    RcppRoll_0.2.2     tibble_1.4.2      
[77] bindrcpp_0.2.2     lava_1.6.1         ipred_0.9-6       
> 
t31415 commented 6 years ago

The cutoffs used are in the same order apparently after all, so then I think the above would have been the correct balanced accuracy after all, so I really dont get why it would be so much higher in CV results than the one in ConfusionMatrix..

topepo commented 6 years ago

trainControl will not pass arguments into the summary function (and b isn't used inside of the function). To be honest, you've made this really more complex than it probably needs to be.

Does that difference mean something is wrong, or is it possible for the latter to be much less good than the former, and why is that?

So, to try to answer your question, the difference could really be driven by the dataset size. Since there is an imbalance, the number of events is probably pretty low so that the variability in the sensitivity is probably pretty high (so then the J statistic that you're calculating will have high variance too).

am I right in assuming that the former is a CV error, ie the average of all hold out errors, while the latter is a pure training error as it doesn't hold out the data it predicts and should therefore yield a higher (apparent) performance due to more overfitting?

train reports the average of the statistics computed inside the iterations of resampling and your estimate using confusionMatrix uses the same predictions but pools the data. I would normally say that the latter approximates the former (but they would not be the same). I've done the same thing to show the out-of-sample data to show an ROC curve to try to "approximate" the curve being estimated by resampling.

I'm trying to understand how its possible that confusionMatrix is always the more pessimistic one,

I would have thought that it would be more optimistic since it is pooling a lot of data together to make the estimate.

Try mocking up an example for testing. twoClassSim() has an option for the intercept and that can be used to modulate the event rate. Using a simulated data set would help because you could also simulate a very large test set to approximate the main component of error of the model.

t31415 commented 6 years ago

Hi Max, thanks for your reply. I think there was actually also another problem with my code, I had really neglected something much more basic I think, but I think now I understand what error (or at least one of the errors) I made in trying to use custom metrics (I should have mentioned it here afterwards but I still wasnt 100% sure and I was also in the middle of an exam session so i didnt come to doing that yet). Maybe you can confirm if it makes sense that the following could be the reason for the discrepancy in the opposite sense (ie more pessimistic instead of optimistic), or whether its still not correct:

You see, all of the custom metrics where I had noticed this problem, they were all cases where I wanted to use an optimal cutoff value. So for example not just the F-b measure or Balanced Accuracy at cutoff 0.5, but at that cutoff where it becomes maximal. But I think now I overlooked the fact that caret will (probably?) be optimizing this cutoff in every repeat and for every holdout fold separately, causing it to use different cutoffs for different holdout folds and for different repeats, which is of course not going to mimic realistic performance on the test set because you can only use one single predetermined cutoff there..

And then when I use the pooled predictions, I could optimize only once a single cutoff that is optimal for the whole pooled group, so it yields of course a much lower (but much more realistic) performance. Do you think this explanation of the difference makes sense?

So what I should have done probably was to use a fixed cutoff in that summary function for the custom metric, and then afterwards run an external loop over a few of these cutoffs to select the the value where the performance is best, no? I thought caret would do it like this but in retrospect it was of course doing this within every other holdout fold and every other repeat no, instead of once for the whole bunch?

So I wonder a bit now if there would have been an easy way to include that cutoff as an extra tuning parameter in caret?

(So sorry for the poor question from originally by the way, it was the first time i'm using the package (and its really incredibly usefull btw, nice work!) but I still didnt understand well enough how it worked apparently). Best regards and thanks again for your comments, Tim

topepo commented 6 years ago

About

So what I should have done probably was to use a fixed cutoff in that summary function for the custom metric, and then afterwards run an external loop over a few of these cutoffs to select the the value where the performance is best, no?

and

So I wonder a bit now if there would have been an easy way to include that cutoff as an extra tuning parameter in caret?

I wrote an example of some custom code to optimize the threshold based on metrics during resampling. That method is a little complex though and gets more complex when the model uses the "sub-model trick".

Instead, I wrote a function that can do the cutoff analysis after the model has been fit called thresholder. That would probably work well for what you are trying to do (or, if not, you could adapt it).

I thought caret would do it like this but in retrospect it was of course doing this within every other holdout fold and every other repeat no, instead of once for the whole bunch?

Yes, but you could pick the best threshold using resampling (and understand the variation around that value) then set it for the final model as we would with other tuning parameters.

t31415 commented 6 years ago

Hi Max, that example and the thresholder function indeed seem to be doing exactly what I was referring to here, thanks a lot for your suggestions, and best regards!