h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.94k stars 2k forks source link

Add XGBoost to AutoML #11391

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Add diverse handful of XGBoost models (some shallow, some deeper, and different row/col sample rates, etc) to the top of the AutoML queue using parameters chosen by [~accountid:557058:8dd31304-4ae2-4b33-9c8f-131377035b71], [~accountid:557058:3402c6e3-c528-4a01-8b6b-85a92dd2a5f8] and [~accountid:557058:948d1d12-c9bb-4ce6-81b5-f7c9ecc76d88]. Also compare the ranges against this [project|https://rdrr.io/github/ja-thomas/autoxgboost/man/autoxgbparset.html].

exalate-issue-sync[bot] commented 1 year ago

Michal Malohlava commented: Ping - any progress?

exalate-issue-sync[bot] commented 1 year ago

Navdeep commented: Not yet, waiting for Mateusz PR

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] I started to add a default XGBoost (with just algo defaults) and an hyperparameter search to check that this is working fine with search params inspired by GBM grid + https://rdrr.io/github/ja-thomas/autoxgboost/man/autoxgbparset.html.

What is the usual approach to tune the params for the default models?

ask the people mentioned in the ticket if they already have reasonable defaults to offer me?

run multiple grid searches against various datasets, and find patterns among the best ones?

Also, do we want to have a XGBoost grid in AutoML?

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: [~accountid:5b153fb1b0d76456f36daced] I think the default XGBoost is not great, so we may choose to skip it, or at least to increase ntrees from 50 to something much higher and use early stopping. The default learn_rate/eta of 0.3 is not great either.

We may want to do what we did with H2O GBM, which is hard-code a few "good" models and then do a grid after. Let's discuss internally what we want the "good" model to be.

exalate-issue-sync[bot] commented 1 year ago

Peter Zbornik commented: Some suggestions:

Here is a very simple approach to parameter tuning in xgboost based on best-practice recommendations: https://github.com/SylwiaOliwia2/xgboost-AutoTune

Hyperband seems to be an improved random search strategy, which beats bayesian optimization: https://people.eecs.berkeley.edu/~kjamieson/hyperband.html Here's the paper: https://arxiv.org/pdf/1603.06560.pdf

Here is an implementation (experimental) for NN and pyTorch: https://github.com/kevinzakka/hypersearch and here's one for the xgboost (classifier), among others: https://github.com/zygmuntz/hyperband

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: [~accountid:5b7bc106eb18d0589b9b1ba5] Thanks for the recommendations. At this point, the optimization strategy we are using inside of AutoML for all models (including the new XGBoost models) is random search since that's already implemented in H2O and works decently well with stacked ensembles. In the future, we hope to expand to other strategies like bayesian optimization and hyperband after adding the functionality to H2O.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] just to keep a track of this. I tried AutoML on the MiniBoone dataset: https://www.openml.org/d/41150.

the dataset is first split 70 | 15 | 15 and the leaderboard is sorted using validation metrics. the test prediction metrics are provided separately after the leaderboard (pasting only top 5 here).

{noformat} @@@@@@@ training dataset miniboone [binomial_bigger] @@@@@@@ dataset dimensions: full:[130064, 51], train:[90948, 51], validation:[19601, 51], test:[19515, 51]

11.09.2018 00:18:52: starting automl_binomial_bigger training

AutoML progress: |████████████████████████████████████████████████████████████████████████████| 100%

11.09.2018 05:04:19: results for automl_binomial_bigger training

model_id auc logloss mean_per_class_error rmse mse


XGBoost_grid_0_AutoML_20180911_001852_model_4 0.985963 0.135552 0.0574915 0.199697 0.0398789 XGBoost_grid_0_AutoML_20180911_001852_model_3 0.98593 0.135845 0.0594074 0.200643 0.0402576 XGBoost_2_AutoML_20180911_001852 0.985789 0.136418 0.0597035 0.200489 0.0401959 XGBoost_grid_0_AutoML_20180911_001852_model_0 0.985783 0.136301 0.0609206 0.200359 0.0401436 XGBoost_1_AutoML_20180911_001852 0.985561 0.137563 0.061812 0.201449 0.0405818 StackedEnsemble_AllModels_0_AutoML_20180911_001852 0.985417 0.149032 0.0589632 0.204151 0.0416778 StackedEnsemble_BestOfFamily_0_AutoML_20180911_001852 0.985136 0.149864 0.058806 0.204773 0.0419322 GBM_4_AutoML_20180911_001852 0.984954 0.140372 0.0603351 0.203335 0.0413453 XGBoost_grid_0_AutoML_20180911_001852_model_2 0.984932 0.140684 0.06038 0.203879 0.0415665 XGBoost_0_AutoML_20180911_001852 0.984904 0.140607 0.0622329 0.203763 0.0415195

[21 rows x 6 columns] {noformat}

test prediction metrics: {noformat} test performance for leader #1: XGBoost_grid_0_AutoML_20180911_001852_model_4 MSE: 0.0382681405044 RMSE: 0.195622443765 LogLoss: 0.130875011799 Mean Per-Class Error: 0.0524245402525 AUC: 0.986817877393

test performance for leader #2: XGBoost_grid_0_AutoML_20180911_001852_model_3 MSE: 0.038604090037 RMSE: 0.196479235638 LogLoss: 0.132357706773 Mean Per-Class Error: 0.0520469771935 AUC: 0.98641021684

test performance for leader #3: XGBoost_2_AutoML_20180911_001852 MSE: 0.0387343768915 RMSE: 0.196810510114 LogLoss: 0.13187766764 Mean Per-Class Error: 0.0529061457167 AUC: 0.98668383515

test performance for leader #4: XGBoost_grid_0_AutoML_20180911_001852_model_0 MSE: 0.0385873085579 RMSE: 0.196436525519 LogLoss: 0.1318878718 Mean Per-Class Error: 0.0528822851689 AUC: 0.986606654579

test performance for leader #5: XGBoost_1_AutoML_20180911_001852 MSE: 0.0389238078057 RMSE: 0.197291175184 LogLoss: 0.133065922986 Mean Per-Class Error: 0.0539166806076 AUC: 0.986385588068

test performance for leader #6: StackedEnsemble_AllModels_0_AutoML_20180911_001852 MSE: 0.0398288909507 RMSE: 0.199571768922 LogLoss: 0.143207242521 Mean Per-Class Error: 0.0530746539185 AUC: 0.986353055692

test performance for leader #7: StackedEnsemble_BestOfFamily_0_AutoML_20180911_001852 MSE: 0.0400433958886 RMSE: 0.200108460312 LogLoss: 0.143902401057 Mean Per-Class Error: 0.0535062596536 AUC: 0.986146038584

test performance for leader #8: GBM_4_AutoML_20180911_001852 MSE: 0.0393641928007 RMSE: 0.198404114878 LogLoss: 0.134642163867 Mean Per-Class Error: 0.0535170799075 AUC: 0.986007146107 {noformat}

and the execution plan: {noformat} 00:18:52.934 Info Workflow AutoML job created: 2018.09.11 00:18:52.933 00:18:52.934 Info DataImport Training and validation were both specified; no auto-splitting. 00:18:52.934 Info DataImport Leaderboard frame not provided by the user; leaderboard will use cross-validation metrics instead. 00:18:52.934 Info DataImport training frame: Frame key: _9dde6164e3835c68ff416822f21e4166 cols: 51 rows: 90948 chunks: 32 size: 36505286 checksum: -1075665448297511166 00:18:52.934 Info DataImport validation frame: Frame key: py_32_sid_8108 cols: 51 rows: 19601 chunks: 32 size: 7899462 checksum: -6351752454184487366 00:18:52.934 Info DataImport leaderboard frame: NULL 00:18:52.934 Info DataImport response column: signal 00:18:52.934 Info DataImport fold column: null 00:18:52.934 Info DataImport weights column: null 00:18:52.934 Info Workflow Build control seed: 130088942 00:18:52.935 Info Workflow Setting stopping tolerance adaptively based on the training frame: 0.003315915260401258 00:18:52.935 Info Workflow Project: automl_binomial_bigger 00:18:52.935 Info ModelTraining Disabling Algo: DeepLearning as requested by the user. 00:18:52.935 Info ModelTraining Disabling Algo: LightGBM as requested by the user. 00:18:52.936 Info Workflow AutoML build started: 2018.09.11 00:18:52.935 00:18:52.946 Info ModelTraining Default Random Forest build started 00:19:59.90 Info ModelTraining Default Random Forest build complete 00:19:59.91 Info ModelTraining New leader: DRF_0_AutoML_20180911_001852, auc: 0.9778945995151945 00:19:59.100 Info ModelTraining Extremely Randomized Trees (XRT) Random Forest build started 00:21:10.240 Info ModelTraining Extremely Randomized Trees (XRT) Random Forest build complete 00:21:10.241 Info ModelTraining AutoML: starting GLM hyperparameter search 00:21:10.242 Info ModelTraining GLM hyperparameter search started 00:21:32.287 Info ModelTraining Built: 1 models for search: GLM hyperparameter search 00:21:32.288 Info ModelTraining GLM hyperparameter search complete 00:21:32.289 Info ModelTraining XGBoost_0_AutoML_20180911_001852 started 00:34:04.613 Info ModelTraining XGBoost_0_AutoML_20180911_001852 complete 00:34:04.614 Info ModelTraining New leader: XGBoost_0_AutoML_20180911_001852, auc: 0.984904476772391 00:34:04.616 Info ModelTraining XGBoost_1_AutoML_20180911_001852 started 00:44:46.197 Info ModelTraining XGBoost_1_AutoML_20180911_001852 complete 00:44:46.198 Info ModelTraining New leader: XGBoost_1_AutoML_20180911_001852, auc: 0.985561079584663 00:44:46.200 Info ModelTraining XGBoost_2_AutoML_20180911_001852 started 00:59:04.984 Info ModelTraining XGBoost_2_AutoML_20180911_001852 complete 00:59:04.985 Info ModelTraining New leader: XGBoost_2_AutoML_20180911_001852, auc: 0.9857891845404257 00:59:04.987 Info ModelTraining GBM 1 started 01:00:26.305 Info ModelTraining GBM 1 complete 01:00:26.308 Info ModelTraining GBM 2 started 01:01:53.665 Info ModelTraining GBM 2 complete 01:01:53.669 Info ModelTraining GBM 3 started 01:03:27.982 Info ModelTraining GBM 3 complete 01:03:27.985 Info ModelTraining GBM 4 started 01:04:52.298 Info ModelTraining GBM 4 complete 01:04:52.301 Info ModelTraining GBM 5 started 01:06:37.661 Info ModelTraining GBM 5 complete 01:06:37.664 Info ModelTraining AutoML: starting XGBoost hyperparameter search 01:06:37.665 Info ModelTraining XGBoost hyperparameter search started 03:25:34.312 Info ModelTraining Built: 1 models for search: XGBoost hyperparameter search 03:38:47.229 Info ModelTraining Built: 2 models for search: XGBoost hyperparameter search 04:03:09.722 Info ModelTraining Built: 3 models for search: XGBoost hyperparameter search 04:21:13.679 Info ModelTraining Built: 4 models for search: XGBoost hyperparameter search 04:21:13.681 Info ModelTraining New leader: XGBoost_grid_0_AutoML_20180911_001852_model_3, auc: 0.9859303982152101 04:38:20.448 Info ModelTraining Built: 5 models for search: XGBoost hyperparameter search 04:38:20.454 Info ModelTraining New leader: XGBoost_grid_0_AutoML_20180911_001852_model_4, auc: 0.9859633194719469 04:38:20.454 Info ModelTraining XGBoost hyperparameter search complete 04:38:20.454 Info ModelTraining AutoML: starting GBM hyperparameter search 04:38:20.461 Info ModelTraining GBM hyperparameter search started 04:38:58.551 Info ModelTraining Built: 1 models for search: GBM hyperparameter search 05:02:43.284 Info ModelTraining Built: 2 models for search: GBM hyperparameter search 05:02:57.329 Info ModelTraining Built: 3 models for search: GBM hyperparameter search 05:02:57.332 Info ModelTraining GBM hyperparameter search complete 05:02:57.344 Info ModelTraining StackedEnsemble build using all AutoML models started 05:04:08.84 Info ModelTraining StackedEnsemble build using all AutoML models complete 05:04:08.89 Info ModelTraining StackedEnsemble build using top model from each algorithm type started 05:04:19.148 Info ModelTraining StackedEnsemble build using top model from each algorithm type complete 05:04:19.152 Info Workflow AutoML: build done; built 21 models {noformat}

as we see, each XGBoost easily takes at least 10-20mn (even 2h for the first model of XGBoost grid, most likely as {{XGBoost_grid_0_AutoML_20180911_001852_model_0}} picked min_rows=min_child_weight=0.01) for a performance slightly better than the GBM one.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: XGBoost default parameters or range on various automl projects

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] please review the performance and let us know if we can keep it ON

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: [~accountid:5b153fb1b0d76456f36daced] Can you add some documentation about what XGBoost models were trained (how many default XGBoost with what params, and what the grid space is, and in what order) to the Description of this ticket? There is a description in the PR, but I think it might be out of date since it mentions LightGBM (please update it there too): https://github.com/h2oai/h2o-3/pull/2828#issue-214414249

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4507 Assignee: Erin LeDell Reporter: Erin LeDell State: Closed Fix Version: 3.22.0.1 Attachments: N/A Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/2828