h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.91k stars 2k forks source link

Ability to keep the best k models in AutoML #12018

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

If we added the ability to keep the "best" group of k models, then we could recycle memory in the H2O cluster and run the AutoML training process for much longer and train more models. After a certain point, having so many models will use all the memory of the H2O cluster and force us to stop training.

Since most of the time (and by default) we train ensembles in AutoML, the "best k" models to keep may not be the "top k" models based on model performance (though that would be the naive way to do it). Instead, it would be the collection of models that contains some of the best performing models, but also has a lot of diversity so that the ensemble is the best possible ensemble of k models. We need to establish a mechanism for choosing the best group of k models for the ensemble w/o enumerating through all the possible ensembles of k models.

The approach will be based on minimizing the correlation between the errors of the base learners.

This is related to: https://0xdata.atlassian.net/browse/PUBDEV-4053

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] maybe we could also progressively save models to disk instead of keeping them in memory. This could be based on either conditions:

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: [~accountid:5b153fb1b0d76456f36daced] Yeah we should have the option of writing them to disk vs just deleting them.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5146 Assignee: UNASSIGNED Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A