h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

AutoML: Set balance_classes = TRUE for datasets with a >10:1 imbalance #11623

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

If there is more than a 10:1 imbalance in the response column, let's turn on balance_classes = TRUE for all the models in AutoML. We should also consider exposing the balance_classes arg (set to "AUTO" by default) and the other related arguments, class_sampling_factors = NULL, max_after_balance_size = 5.

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: FYI: [https://github.com/h2oai/h2o-tutorials/blob/master/best-practices/imbalanced-data/imbalanced_data_handling.ipynb|https://github.com/h2oai/h2o-tutorials/blob/master/best-practices/imbalanced-data/imbalanced_data_handling.ipynb|smart-link]

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: Hey [~accountid:5e43370f5a495e0c91a74ebe] this is something we could try out and benchmark on some imbalanced datasets… however, if you wanted to explore more sophisticated ideas for handling class imbalance, we could expand the scope of the ticket. I thought having this simple rule would be a low-tech “solution” to start with since we currently don’t do anything to address class imbalance in AutoML.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4744 Assignee: Tomas Fryda Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A