ploomber / sklearn-evaluation

Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.
https://sklearn-evaluation.ploomber.io
Apache License 2.0
459 stars 54 forks source link

researching good defaults for grids #229

Closed edublancas closed 1 year ago

edublancas commented 1 year ago

@yafimvo is working on a new feature to help users optimize model hyperparameters(https://github.com/ploomber/sklearn-evaluation/pull/215); we defined some defaults for now, based on a blog post, that cites a paper, but we'd like to do more research.

cc @idomic @WSShawn

idomic commented 1 year ago

Let's do XGBoost first, passes log. regression by far.

WSShawn commented 1 year ago

@idomic Working on the XGBoost defaults right now. Do we only want a set of good default values or is there anywhere I can update the code with these values also? I can see in #215 that there is a file for grid values but I could not find it in the repo. Please clarify. Thanks!

idomic commented 1 year ago

For the latter part check this one out.

I think first to share your conclusions on the different classifiers' best values and then we should edit it in the code, and add a reference to the article, justifying this change/values. Once we're done with XGB, we can do logistic regression, random forest etc.

WSShawn commented 1 year ago

Got it! Working on XGBoost tonight, I'll share my conclusions and reference here later.

WSShawn commented 1 year ago

XGBoost

Reference: https://medium.com/broadhorizon-cmotions/hyperparameter-tuning-for-hyperaccurate-xgboost-model-d6e6b8650a11#:~:text=Typically%20used%20values%20are%200.4,its%20default%20value%20is%201

Conclusions:

WSShawn commented 1 year ago

Please leave comments so that I know if I am getting the right conclusions that we want. Thanks!

idomic commented 1 year ago

It's great, seems we were also in the right direction. Let's add logistic regression here as well? Once that's here, you can go ahead and open a PR to revise the defaults and ranges accordingly.

Also, upon checking Gamma, seems like the default value there is 0 and not 1, please double check that the values actually match. Gamma can be any integer and its default value is 0.

WSShawn commented 1 year ago

got it! working on it.

edublancas commented 1 year ago

also, if we can find some peer-reviewed papers, that'd be great! because medium posts might not be rigorous

WSShawn commented 1 year ago

ok!

WSShawn commented 1 year ago

It's great, seems we were also in the right direction. Let's add logistic regression here as well? Once that's here, you can go ahead and open a PR to revise the defaults and ranges accordingly.

Also, upon checking Gamma, seems like the default value there is 0 and not 1, please double check that the values actually match. Gamma can be any integer and its default value is 0.

yes you are right, I'll fix it to 0

WSShawn commented 1 year ago

Update on Logistic regression: I'm still working on finding some more rigorous documents but these are by far the most direct source indicating default parameter values for the model.

Source: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html gives some default values for the following parameters: tol: 1e-4 C: 1 intercept _scaling: 1

further updates: Source: https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/ For parameters in logistic regression, these grid search range of values are used: solver = ['newton-cg', 'lbfgs', 'liblinear'] penalty = ['none', 'l1', 'l2', 'elasticnet'] C = loguniform(1e-5, 100)

edublancas commented 1 year ago

C: logspace(-4, 4, 20) penalty: [‘l1’, ‘l2’] n_estimators: [10, 101, 10] max_features: [6, 32, 5]

these ones look like mix a few models? penalty is for logistic regression, but n_estimators and max_features for random forest?

WSShawn commented 1 year ago

good catch. This post suddenly started to talk about random-forest classifier and I got confused. I'll redo this part tonight.

edublancas commented 1 year ago

any updates on this?