Closed edublancas closed 1 year ago
Let's do XGBoost first, passes log. regression by far.
@idomic Working on the XGBoost defaults right now. Do we only want a set of good default values or is there anywhere I can update the code with these values also? I can see in #215 that there is a file for grid values but I could not find it in the repo. Please clarify. Thanks!
For the latter part check this one out.
I think first to share your conclusions on the different classifiers' best values and then we should edit it in the code, and add a reference to the article, justifying this change/values. Once we're done with XGB, we can do logistic regression, random forest etc.
Got it! Working on XGBoost tonight, I'll share my conclusions and reference here later.
XGBoost
Conclusions:
Please leave comments so that I know if I am getting the right conclusions that we want. Thanks!
It's great, seems we were also in the right direction. Let's add logistic regression here as well? Once that's here, you can go ahead and open a PR to revise the defaults and ranges accordingly.
Also, upon checking Gamma, seems like the default value there is 0 and not 1, please double check that the values actually match. Gamma can be any integer and its default value is 0.
got it! working on it.
also, if we can find some peer-reviewed papers, that'd be great! because medium posts might not be rigorous
ok!
It's great, seems we were also in the right direction. Let's add logistic regression here as well? Once that's here, you can go ahead and open a PR to revise the defaults and ranges accordingly.
Also, upon checking Gamma, seems like the default value there is 0 and not 1, please double check that the values actually match.
Gamma can be any integer and its default value is 0.
yes you are right, I'll fix it to 0
Update on Logistic regression: I'm still working on finding some more rigorous documents but these are by far the most direct source indicating default parameter values for the model.
Source: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html gives some default values for the following parameters: tol: 1e-4 C: 1 intercept _scaling: 1
further updates: Source: https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/ For parameters in logistic regression, these grid search range of values are used: solver = ['newton-cg', 'lbfgs', 'liblinear'] penalty = ['none', 'l1', 'l2', 'elasticnet'] C = loguniform(1e-5, 100)
C: logspace(-4, 4, 20) penalty: [‘l1’, ‘l2’] n_estimators: [10, 101, 10] max_features: [6, 32, 5]
these ones look like mix a few models? penalty is for logistic regression, but n_estimators and max_features for random forest?
good catch. This post suddenly started to talk about random-forest classifier and I got confused. I'll redo this part tonight.
any updates on this?
@yafimvo is working on a new feature to help users optimize model hyperparameters(https://github.com/ploomber/sklearn-evaluation/pull/215); we defined some defaults for now, based on a blog post, that cites a paper, but we'd like to do more research.
cc @idomic @WSShawn