tuning parameters of rpart

paulvanderlaken / ppsr

R implementation of Predictive Power Score

GNU General Public License v3.0

74 stars 9 forks source link

tuning parameters of rpart #34

Open SamGG opened 3 years ago

SamGG commented 3 years ago

Thanks a lot for implementing PPS in R.

The decision tree of sklearn is not set with the same option than R. As sklearn default parameters are so small, I think sklearn's trees are much finer than R's ones leading to better score in Python than in R. For example, I can't reproduce the Suvived vs Sex example based on Titanic dataset. Is there a way to tune the default parameters of rpart?

And more generally, how do you think I could try other algorithms as you nicely implemented PPS with the huge library of models offered by parsnip?

Best.

paulvanderlaken commented 3 years ago

Hi Samuel, I still need to look into passing (hyper)parameters to the models. I think most users wouldn't be interested in tweaking the models. And I don't want to encourage people to use ppsr for anything but exploratory analyses. Still, I'll look into this. If you have any suggestions for code or solutions, please do suggest!

Regarding the other algorithms, I had implemented bagging/boosting methods, but of course these don't add value for univariate relationships. Lately, I've been thinking about introducing SVM... What would you like to see included? It's actually very simple to add new models (as long as we ignore the parameters), would you care to try?

SamGG commented 3 years ago

No clear how to achieve this. I agree that PPS should be kept simple.

My initial point concerned the default parameters of rpart: could they play a role in the score? IMHO the default parameters in scikitlearn are too prone to overfitting: minsplit = 2, minbucket = 1, IIRC. Because I don't like that parameters are defined in another package (parsnip, and I am only guessing that parsnip's defaults are rpart's defaults), I would prefer to have the parameters clearly set in ppsr. May be there could be a graduation such as "robust" and "optimistic" :-)

Whatever, you did a very nice and interesting job.

paulvanderlaken commented 3 years ago

Great suggestion! These parameters definitely influence the scores that are produced. Let me look into this, I like the idea of having robust/optimistics settings