researching good defaults for grids

edublancas commented 1 year ago

@yafimvo is working on a new feature to help users optimize model hyperparameters(https://github.com/ploomber/sklearn-evaluation/pull/215); we defined some defaults for now, based on a blog post, that cites a paper, but we'd like to do more research.

first, we must dive deeper into the blog post and paper I shared above to define sensible defaults for a random forest grid search. Then, we'd like to look for other resources to see if we find more information
we can repeat the same process for other models. I think we can start with XGBoost or logistic regression since those are the most popular models in ML

cc @idomic @WSShawn

idomic commented 1 year ago

Let's do XGBoost first, passes log. regression by far.

WSShawn commented 1 year ago

@idomic Working on the XGBoost defaults right now. Do we only want a set of good default values or is there anywhere I can update the code with these values also? I can see in #215 that there is a file for grid values but I could not find it in the repo. Please clarify. Thanks!

idomic commented 1 year ago

For the latter part check this one out.

I think first to share your conclusions on the different classifiers' best values and then we should edit it in the code, and add a reference to the article, justifying this change/values. Once we're done with XGB, we can do logistic regression, random forest etc.

WSShawn commented 1 year ago

Got it! Working on XGBoost tonight, I'll share my conclusions and reference here later.

WSShawn commented 1 year ago

XGBoost

Reference: https://medium.com/broadhorizon-cmotions/hyperparameter-tuning-for-hyperaccurate-xgboost-model-d6e6b8650a11#:~:text=Typically%20used%20values%20are%200.4,its%20default%20value%20is%201

Conclusions:

Learning_rate (eta): 0.01 - 0.3 with default at 0.3, in the cited example, grid is [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
N_estimators: any integer larger than 0, the larger the integer, the more accurate trainset performance, with default at 100, in the cited example, grid is [50, 500, 50]
Max_depth: 3 - 10 with a default of 6 is very typical, in the cited example, grid is [3,10, 2]
Min_child_weight: can be any number. default is 1, in the cited example, grid is [1, 6, 2]
Subsample: typical value 0.4 - 1 with a default of 1, in the cited example, grid is [0.4, 1, 0.1]
Colsample_bytree: any number in 0.4 - 1, default is usually 1, in the cited example, grid is [0.4, 1, 0.1]
Gamma (min_split_loss): any integer, usually default is 0, in the cited example, grid is [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
Lambda: any integer, usually default is 1, in the cited example, grid is [0, 0.5, 1, 1.5, 2, 3, 4.5]

WSShawn commented 1 year ago

Please leave comments so that I know if I am getting the right conclusions that we want. Thanks!

idomic commented 1 year ago

It's great, seems we were also in the right direction. Let's add logistic regression here as well? Once that's here, you can go ahead and open a PR to revise the defaults and ranges accordingly.

Also, upon checking Gamma, seems like the default value there is 0 and not 1, please double check that the values actually match. Gamma can be any integer and its default value is 0.

WSShawn commented 1 year ago

got it! working on it.

edublancas commented 1 year ago

also, if we can find some peer-reviewed papers, that'd be great! because medium posts might not be rigorous

WSShawn commented 1 year ago

ok!

WSShawn commented 1 year ago

It's great, seems we were also in the right direction. Let's add logistic regression here as well? Once that's here, you can go ahead and open a PR to revise the defaults and ranges accordingly.

Also, upon checking Gamma, seems like the default value there is 0 and not 1, please double check that the values actually match. Gamma can be any integer and its default value is 0.

yes you are right, I'll fix it to 0

WSShawn commented 1 year ago

Update on Logistic regression: I'm still working on finding some more rigorous documents but these are by far the most direct source indicating default parameter values for the model.

Source: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html gives some default values for the following parameters: tol: 1e-4 C: 1 intercept _scaling: 1

further updates: Source: https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/ For parameters in logistic regression, these grid search range of values are used: solver = ['newton-cg', 'lbfgs', 'liblinear'] penalty = ['none', 'l1', 'l2', 'elasticnet'] C = loguniform(1e-5, 100)

edublancas commented 1 year ago

C: logspace(-4, 4, 20) penalty: [‘l1’, ‘l2’] n_estimators: [10, 101, 10] max_features: [6, 32, 5]

these ones look like mix a few models? penalty is for logistic regression, but n_estimators and max_features for random forest?

WSShawn commented 1 year ago

good catch. This post suddenly started to talk about random-forest classifier and I got confused. I'll redo this part tonight.

edublancas commented 1 year ago

any updates on this?

ploomber / sklearn-evaluation

researching good defaults for grids #229