topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 632 forks source link

train.default's predictions are not consistent with the returned model's predictions #943

Closed glenrs closed 6 years ago

glenrs commented 6 years ago

train.default loops many time to find out which parameter combinations are the most effective, but then discards all the created models. I am assuming the models are discarded to save memory. This is an issue when certain models have another randomness components besides hyperparameters such as the random forest. Even with the same initialization we can see a drastic change in performance.

Below, the initial model was 53% correct, but another model with the same parameter combination was 100% correct. In this case we see a dramatic increase performance, but models could also potentially perform much worse.

Minimal dataset:

The minimal dataset that I am using is the pima_diabetes dataset. It is included in the healthcareai package. You can download this package from cran.

install.packages("healthcareai")

Minimal, runnable code:

library(healthcareai)
#> healthcareai version 2.2.0
#> Please visit https://docs.healthcare.ai for full documentation and vignettes. Join the community at https://healthcare-ai.slack.com
library(tidyverse)

d <- prep_data(pima_diabetes, outcome = diabetes)
#> Training new data prep recipe...
d_features <- select(d, -diabetes)
d_outcomes <- pull(d, diabetes)

train_control <- caret::trainControl(method = "cv",
                                     number = 5,
                                     search = "grid",
                                     savePredictions = "final")
train_control$summaryFunction <- caret::twoClassSummary
train_control$classProbs <- TRUE
tune_grid <- data.frame(
  mtry = 3,
  splitrule = "extratrees",
  min.node.size = 1
)

train_list <- caret::train(x = d_features, y = d_outcomes, method = "ranger", 
                           metric = "ROC", trControl = train_control, tuneGrid = tune_grid,
                           importance = "impurity")
#> Loading required package: lattice
#> 
#> Attaching package: 'caret'
#> The following object is masked from 'package:purrr':
#> 
#>     lift
#> Warning: Setting row names on a tibble is deprecated.

#> Warning: Setting row names on a tibble is deprecated.

#> Warning: Setting row names on a tibble is deprecated.

#> Warning: Setting row names on a tibble is deprecated.

#> Warning: Setting row names on a tibble is deprecated.

#> Warning: Setting row names on a tibble is deprecated.

trained_predictions <- train_list$pred$pred
mean(trained_predictions == d_outcomes)
#> [1] 0.5390625

predict_output <- caret::predict.train(train_list, d_features, type = "prob")
predict_predictions <- predict_output$Y
outcome <- ifelse(predict_predictions > .45, "Y", "N")
mean(d_outcomes == outcome)
#> [1] 1

Created on 2018-09-27 by the reprex package (v0.2.0).

Session Info:


sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.5.1  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
#>  [5] tools_3.5.1     htmltools_0.3.6 yaml_2.2.0      Rcpp_0.12.18   
#>  [9] stringi_1.2.4   rmarkdown_1.10  knitr_1.20      stringr_1.3.1  
#> [13] digest_0.6.16   evaluate_0.11

Created on 2018-09-27 by the reprex package (v0.2.0).

topepo commented 6 years ago

I am assuming the models are discarded to save memory.

Yes

Below, the initial model was 53% correct, but another model with the same parameter combination was 100% correct. In this case we see a dramatic increase performance, but models could also potentially perform much worse.

You can't pick and choose which resampled model to use; you are using resampling to estimate performance of the random forest model and that uses all of the resamples.

The range in performance that you see is driven by a lot of different things. It's not that one resampled model fit is better than the other, they are random realizations of that model on different data sets (and not an increase in performance). There is often a resample-to-resample effect, meaning that some resamples have good performance across many models (or submodels). This is most likely what you are seeing.