Closed t31415 closed 4 years ago
Providing a reproducible example and the results of sessionInfo
will help get your question answered.
Hi, I've tried to make a simplified example but in the example dataset I found for imbalanced data I'm not able to reprocude the differences of 10% in balanced accuracy I get in my own dataset. The dataset I was working with is rather big and a lot of code was indirectly involved in the custom metrics etc, so I tried to come up with a simple example, but here the difference is a lot smaller. Here is a minimal example:
install.packages('imbalance')
library(imbalance)
#define a custom metric and its traincontrol:
BalAc<-function(data,lev=NULL,model=NULL,b=NULL){
#just for compatibility with other we keep a useless b=.. argument
sensvals<-performance(prediction(data$positive,data$obs),'sens')@y.values[[1]]
specvals<-performance(prediction(data$positive,data$obs),'spec')@y.values[[1]]
balacvals<-(sensvals+specvals)/2
optcut<-performance(prediction(data$positive,data$obs),'sens')@x.values[[1]][which.max(balacvals)]
optBA<-balacvals[which.max(balacvals)]
out<-c(optBA,optcut)
names(out) <- c("OptBalAcc", "OptCutoff")
out
}
trainControlSumfun0<-function(method,number,repeats=NULL,sumfun,...){
trainControl(method = method,
number = number,repeats=repeats,savePredictions='final',
classProbs=TRUE,sampling='smote',
summaryFunction = function(data,lev=NULL,model=NULL){ sumfun(data,lev,model,...)})
}
fitControl<-trainControlSumfun0(method = 'repeatedcv',number = 10,repeats=5,
sumfun = BalAc,b=2)
#Train a model
trainobj<-train(Class ~., data=newthyroid1,
method='mda',metric='OptBalAcc',maximize=TRUE,trControl=fitControl,b=2)
#I would think that the following two would be roughly the same, or that at least the former would be less optimistic than the latter, but often it is the other way round:
trainobj$results
confusionMatrix(trainobj$pred$pred,trainobj$pred$obs)
Thanks,
PS: the output of 'sessioninfo' is the following:
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel grid stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] glmnet_2.0-16 Matrix_1.2-14 doParallel_1.0.11 iterators_1.0.9
[5] foreach_1.4.4 imbalance_1.0.0 logicFS_1.50.0 mcbiopi_1.1.2
[9] LogicReg_1.5.9 survival_2.41-3 kernlab_0.9-26 mlbench_2.1-1
[13] DMwR_0.4.1 praznik_5.0.0 pryr_0.1.4 caret_6.0-79
[17] ggplot2_2.2.1 lattice_0.20-35 ROCR_1.0-7 gplots_3.0.1
loaded via a namespace (and not attached):
[1] nlme_3.1-137 bitops_1.0-6 xts_0.10-2 lubridate_1.7.4
[5] dimRed_0.1.0 C50_0.1.1 tools_3.5.0 R6_2.2.2
[9] rpart_4.1-13 KernSmooth_2.23-15 lazyeval_0.2.1 colorspace_1.3-2
[13] nnet_7.3-12 withr_2.1.2 gbm_2.1.3 tidyselect_0.2.4
[17] mnormt_1.5-5 curl_3.2 compiler_3.5.0 mda_0.4-10
[21] Cubist_0.2.1 caTools_1.17.1 scales_0.5.0 sfsmisc_1.1-2
[25] DEoptimR_1.0-8 mvtnorm_1.0-7 psych_1.8.4 robustbase_0.93-0
[29] stringr_1.3.1 foreign_0.8-70 pkgconfig_2.0.1 rlang_0.2.0
[33] TTR_0.23-3 ddalpha_1.3.3 quantmod_0.4-13 bindr_0.1.1
[37] zoo_1.8-1 gtools_3.5.0 dplyr_0.7.4 ModelMetrics_1.1.0
[41] magrittr_1.5 Formula_1.2-3 Rcpp_0.12.16 munsell_0.4.3
[45] abind_1.4-5 partykit_1.2-1 stringi_1.2.2 inum_1.0-0
[49] MASS_7.3-49 plyr_1.8.4 recipes_0.1.2 gdata_2.18.0
[53] splines_3.5.0 pillar_1.2.2 reshape2_1.4.3 codetools_0.2-15
[57] stats4_3.5.0 CVST_0.2-1 magic_1.5-8 glue_1.2.0
[61] gtable_0.2.0 purrr_0.2.4 tidyr_0.8.0 assertthat_0.2.0
[65] DRR_0.0.3 gower_0.1.2 prodlim_2018.04.18 libcoin_1.0-1
[69] broom_0.4.4 e1071_1.6-8 class_7.3-14 geometry_0.3-6
[73] timeDate_3043.102 smotefamily_1.2 RcppRoll_0.2.2 tibble_1.4.2
[77] bindrcpp_0.2.2 lava_1.6.1 ipred_0.9-6
>
The cutoffs used are in the same order apparently after all, so then I think the above would have been the correct balanced accuracy after all, so I really dont get why it would be so much higher in CV results than the one in ConfusionMatrix..
trainControl
will not pass arguments into the summary function (and b
isn't used inside of the function). To be honest, you've made this really more complex than it probably needs to be.
Does that difference mean something is wrong, or is it possible for the latter to be much less good than the former, and why is that?
So, to try to answer your question, the difference could really be driven by the dataset size. Since there is an imbalance, the number of events is probably pretty low so that the variability in the sensitivity is probably pretty high (so then the J statistic that you're calculating will have high variance too).
am I right in assuming that the former is a CV error, ie the average of all hold out errors, while the latter is a pure training error as it doesn't hold out the data it predicts and should therefore yield a higher (apparent) performance due to more overfitting?
train
reports the average of the statistics computed inside the iterations of resampling and your estimate using confusionMatrix
uses the same predictions but pools the data. I would normally say that the latter approximates the former (but they would not be the same). I've done the same thing to show the out-of-sample data to show an ROC curve to try to "approximate" the curve being estimated by resampling.
I'm trying to understand how its possible that
confusionMatrix
is always the more pessimistic one,
I would have thought that it would be more optimistic since it is pooling a lot of data together to make the estimate.
Try mocking up an example for testing. twoClassSim()
has an option for the intercept and that can be used to modulate the event rate. Using a simulated data set would help because you could also simulate a very large test set to approximate the main component of error of the model.
Hi Max, thanks for your reply. I think there was actually also another problem with my code, I had really neglected something much more basic I think, but I think now I understand what error (or at least one of the errors) I made in trying to use custom metrics (I should have mentioned it here afterwards but I still wasnt 100% sure and I was also in the middle of an exam session so i didnt come to doing that yet). Maybe you can confirm if it makes sense that the following could be the reason for the discrepancy in the opposite sense (ie more pessimistic instead of optimistic), or whether its still not correct:
You see, all of the custom metrics where I had noticed this problem, they were all cases where I wanted to use an optimal cutoff value. So for example not just the F-b measure or Balanced Accuracy at cutoff 0.5, but at that cutoff where it becomes maximal. But I think now I overlooked the fact that caret will (probably?) be optimizing this cutoff in every repeat and for every holdout fold separately, causing it to use different cutoffs for different holdout folds and for different repeats, which is of course not going to mimic realistic performance on the test set because you can only use one single predetermined cutoff there..
And then when I use the pooled predictions, I could optimize only once a single cutoff that is optimal for the whole pooled group, so it yields of course a much lower (but much more realistic) performance. Do you think this explanation of the difference makes sense?
So what I should have done probably was to use a fixed cutoff in that summary function for the custom metric, and then afterwards run an external loop over a few of these cutoffs to select the the value where the performance is best, no? I thought caret would do it like this but in retrospect it was of course doing this within every other holdout fold and every other repeat no, instead of once for the whole bunch?
So I wonder a bit now if there would have been an easy way to include that cutoff as an extra tuning parameter in caret?
(So sorry for the poor question from originally by the way, it was the first time i'm using the package (and its really incredibly usefull btw, nice work!) but I still didnt understand well enough how it worked apparently). Best regards and thanks again for your comments, Tim
About
So what I should have done probably was to use a fixed cutoff in that summary function for the custom metric, and then afterwards run an external loop over a few of these cutoffs to select the the value where the performance is best, no?
and
So I wonder a bit now if there would have been an easy way to include that cutoff as an extra tuning parameter in caret?
I wrote an example of some custom code to optimize the threshold based on metrics during resampling. That method is a little complex though and gets more complex when the model uses the "sub-model trick".
Instead, I wrote a function that can do the cutoff analysis after the model has been fit called thresholder
. That would probably work well for what you are trying to do (or, if not, you could adapt it).
I thought caret would do it like this but in retrospect it was of course doing this within every other holdout fold and every other repeat no, instead of once for the whole bunch?
Yes, but you could pick the best threshold using resampling (and understand the variation around that value) then set it for the final model as we would with other tuning parameters.
Hi Max, that example and the thresholder function indeed seem to be doing exactly what I was referring to here, thanks a lot for your suggestions, and best regards!
I noticed that the results in my models obtained from caret:train , as in
fitCtrl<-trainControl(method='repeatedcv',num=10,repeats=5,savePredictions=TRUE,classProbs=TRUE,summaryFunction=BalAc) model<-train(Y~.,data=data1,metric='OptBalAcc',method='qda',trControl=fitCtrl) model$results
are much more optimistic than what I get from
confusionMatrix(model$pred$pred,model$pred$obs)
, where I had expected these to be the same. Or at least for the latter (which I thought were the predictions made by the best model when it is applied again to all of the folds together?) to be at least as good as the former (which I thought is the mean of the holdout performances for each fold/repeat and thus a little less subject to overfitting due to being held out each time when the fit takes place?)
So I'm trying to understand how its possible that Confusionmatrix is always the more pessimistic one, when it is just the training error made when the same best model is used to predict all the data? Or is this not the case? Thanks