topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 633 forks source link

Inconsistency in displayed statistics between summary(resamples(models)) and ggplot(resamples(models)) #952

Open adamryczkowski opened 6 years ago

adamryczkowski commented 6 years ago
summary(resamples(models))
# ...
# RMSE 
#                     Min.  1st Qu.   Median     Mean   3rd Qu.      Max.  NA's
# icr              6.622260 7.842431 8.377997 8.494023  9.128963 10.233956     0
# cforest          6.588902 7.814897 8.448722 8.502404  9.019769 10.275960     0
# plsRglm          7.062627 7.916284 8.550767 8.514421  8.988415 10.991518     0
# enet             7.279606 8.072162 8.484581 8.535624  8.888511 10.451202     0
# ranger           6.638987 8.180931 8.563769 8.555081  8.940539 10.848453     0
# BstLm            6.782110 8.091716 8.446203 8.567882  9.015425 10.657177     0
# bridge           7.078735 8.053071 8.579939 8.563488  9.024235 10.879757     0
# glmnet           6.722979 8.075154 8.458597 8.573572  9.002222 10.682687     0
# ...

These are standard positional statistics, whereas

ggplot(resamples(models),metric="RMSE")

returns a chart that shows confidence intervals sampled over all resamples for each model.

I could fix that (by providing an option to the caret:::ggplot.resample), but I see a 2-month long backlog of unreviewed pull requests in your repository and I wonder, what would be a time frame for my patch to be reviewed.

If you do not wish to maintain this package anymore, I would really appreciate if you would point me to the most promising (from your perspective) fork, so I would have a place where to put patches.

topepo commented 6 years ago

Please read the posting guide about providing a reprex along with session information etc.

If you want to get the confidence intervals back, a better approach would be to make a broom::tidy method for diff.resamples. Even better would be to use tidyposterior to get credible intervals.

If you do not wish to maintain this package anymore, I would really appreciate if you would point me to the most promising (from your perspective) fork, so I would have a place where to put patches.

It is being maintained, just not at the rate that you appreciate. If you want put a PR in, I'll get to it when current deadlines permit (probably in late Nov/early Dec). I would suggest branching off master; the main development branch is independent of these changes.

adamryczkowski commented 6 years ago

Thank you for answering.

Actually I do not want the confidence intervals, because they depend on the amount of resamples I take (and if this number goes to infinity, those intervals reduce asymptotically to zero, and make little sense).

I want to have an option to draw the interquantile range (or: 5% - 95% quantile range) instead.

I have already implemented an option to error_statistic to the ggplot.resamples function in my fork of your code. It serves its purpose and I am done with the problem. I do not want to add to your burden, so tell me if you are interested in reviewing this patch, and if so, I would update the documentation and publish a PR.

topepo commented 6 years ago

I'm going to start working on a new release tomorrow. If you want to put a PR in, it would be a good time to do it.