h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.83k stars 2k forks source link

AutoML: turn on parallel grid search by default #8427

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

We should do some benchmarks first, but [~accountid:5a32df017dcf343865c26fa5]'s benchmarks are looking like this is usually very helpful: https://github.com/Pscheidl/h2o-parallel-grid-search-benchmark

Once we establish that this is helpful in most common hardware/dataset scenarios, then we should turn on parallel grid search for all the grids in AutoML by default.

exalate-issue-sync[bot] commented 1 year ago

Juan Telleria commented: Would it be possible to add in current version of h2o-3 AutoML, 3.28, a “parallelism" argument (Similar to the one in Grid Search), which allows to manually enable the parallelism (Even if not enabled by default)?

Thanks!

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:59f71b4f040e6b6cf1c0a904] Thanks for your suggestion, I’ll consider the possibility to add a {{max_parallelism}} param for AutoML.

However, I don’t think it’s reasonable to strictly enforce the level of parallelism – that’s why I suggest an upper limit instead – as there are differences between algorithms.

For example, on single node and if using GPU, we can’t train multiple XGBoosts in parallel for example.

On top of this, {{AutoML}} is using cross-validation by default and we’re already trying to train some of the CV models in parallel, so if we expose a {{parallelism}} argument with the same semantic as the one exposed in {{GridSearch}} today, it may not always behave as expected…

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: Also note that parallelization probably won’t be restricted to Grids soon…

exalate-issue-sync[bot] commented 1 year ago

Juan Telleria commented: Thanks for your insights and having it into consideration! :)

Best,

Juan

exalate-issue-sync[bot] commented 1 year ago

Juan Telleria commented: In automl we could have 2 kinds of parallelism:

So we have a bunch of possible combinations...

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:59f71b4f040e6b6cf1c0a904] you’re absolutely right, that’s why we have to test each of those independently as too much parallelism can turn harmful:

Currently, AutoML uses CV by default, and GBM+XGBoost train 2 CV models in parallel by default (only 1 for XGB if running on GPU).

On top of this, I’m about to run some benchmarks when activating parallelism for our grid searches: as they also train models using CV, we’re doubling the level of parallelism compared to a normal grid.

Finally, if those grid results are positive, we will consider also training the default models (XGB, DRF…) in parallel : the current implementation allows to switch this easily.

However, for the end user, we will just expose one parameter that will try to sum up this into one number.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7206 Assignee: Sebastien Poirier Reporter: Erin LeDell State: Open Fix Version: Backlog Attachments: N/A Development PRs: N/A