Closed glenrs closed 6 years ago
I am assuming the models are discarded to save memory.
Yes
Below, the initial model was 53% correct, but another model with the same parameter combination was 100% correct. In this case we see a dramatic increase performance, but models could also potentially perform much worse.
You can't pick and choose which resampled model to use; you are using resampling to estimate performance of the random forest model and that uses all of the resamples.
The range in performance that you see is driven by a lot of different things. It's not that one resampled model fit is better than the other, they are random realizations of that model on different data sets (and not an increase in performance). There is often a resample-to-resample effect, meaning that some resamples have good performance across many models (or submodels). This is most likely what you are seeing.
train.default
loops many time to find out which parameter combinations are the most effective, but then discards all the created models. I am assuming the models are discarded to save memory. This is an issue when certain models have another randomness components besides hyperparameters such as the random forest. Even with the same initialization we can see a drastic change in performance.Below, the initial model was 53% correct, but another model with the same parameter combination was 100% correct. In this case we see a dramatic increase performance, but models could also potentially perform much worse.
Minimal dataset:
The minimal dataset that I am using is the pima_diabetes dataset. It is included in the
healthcareai
package. You can download this package from cran.Minimal, runnable code:
Created on 2018-09-27 by the reprex package (v0.2.0).
Session Info:
Created on 2018-09-27 by the reprex package (v0.2.0).