h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

Research if categorical_encoding should be optimized in AutoML #8143

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Research if categorical_encoding should be a parameter that is optimized in AutoML. I have found that sometimes setting this to a value other than AUTO improves results: https://github.com/h2oai/h2o-tutorials/blob/master/best-practices/categorical-predictors/gbm_drf.ipynb

exalate-issue-sync[bot] commented 1 year ago

Megan Kurka commented: [~accountid:5b153fb1b0d76456f36daced] just to clarify this Jira is just to research if it is worth it to optimized categorical_encoding parameter in AutoML, not to actually implement it.

To me, research would mean checking how much optimizing this parameter improves performance for a variety of datasets and seeing if that performance improvement outweighs the added time it may take to run.

Please let me know if anything is unclear for this jira.

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: [~accountid:557058:f0137791-c6cb-47bd-bcce-fc81ad4cfefa] Here’s what we can do:

Make a fork of the code with categorical_encoding (all possible values) added to the existing grid searches that support it (GBM, DNN, XGBoost).

Execute the current H2O AutoML (3.30.0.*) on the OpenML AutoML benchmark classification datasets (20 of them have categorical columns). Run for 1 hour.

Execute the forked H2O AutoML on OpenML AutoML benchmark classification datasets (20 of them have categorical columns). First run for 1 hour, and then run for 6 hours. 1 hour: We first want to test the degredation in a fixed time comparison to the baseline. 6 hour: Then we also want to assume the “unlimited time” use-case. So since there are six values for categorical_encoding, we will extend the time by 6x.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7495 Assignee: Sebastien Poirier Reporter: Megan Kurka State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A