h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

AutoML XGBoost score_tree_interval = 5 cause really bad performances on big datasets #9149

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Hi, i noticed that performances for XGBoost-related models via AutoML are really bad for a dataset that I have, composed of ~15M rows, and 126 predictors.

I noticed that in the logs:

05-09 15:25:49.075 127.0.0.1:54321 27825 FJ-3-57 INFO: 46. tree was built in 00:00:02.388 (Wall: 09-May 15:25:49.075) 05-09 15:25:51.650 127.0.0.1:54321 27825 FJ-3-57 INFO: 47. tree was built in 00:00:02.575 (Wall: 09-May 15:25:51.650) 05-09 15:25:54.086 127.0.0.1:54321 27825 FJ-3-57 INFO: 48. tree was built in 00:00:02.435 (Wall: 09-May 15:25:54.085) 05-09 15:25:56.666 127.0.0.1:54321 27825 FJ-3-57 INFO: 49. tree was built in 00:00:02.580 (Wall: 09-May 15:25:56.666) 05-09 15:25:59.205 127.0.0.1:54321 27825 FJ-3-57 INFO: 50. tree was built in 00:00:02.539 (Wall: 09-May 15:25:59.205)

[... scoring round, lasts 60s ...]

05-09 15:27:02.595 127.0.0.1:54321 27825 FJ-3-57 INFO: 51. tree was built in 00:00:02.361 (Wall: 09-May 15:27:02.595)

That is, a tree is built every ~2.5 seconds. The problem is that every 5 trees, a new scoring round is launched, that lasts ~60 seconds. So, every ~15 seconds spent on building trees, 60s are spent scoring.

I think that the score_tree_interval MUST be configurable for AutoML too, or at least increased by 10 times.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:0dc6b079-bb4a-4a21-9e26-e07d6b8241eb] thanks for reporting, we are looking into ways to speed up the scoring to an extent that it won't be slowing down AutoML training like this. This is an issue that will be addressed.

Can you please share the full log? What OS are you using?

We have a bug on OS X where the scoring is very slow and that we are fixing in the next release: PUBDEV-6476

exalate-issue-sync[bot] commented 1 year ago

Alessandro P. commented: I have added some logs. I run h2o on a linux machine, with 250gb ram, 24 threads.

from those 2 lines, you can infer that the tree construction time takes 5 * 2.5s = 12.5s for every scoring session, that lasts ~50s.

05-16 09:25:43.326 1.2.3.4:54321 #6968 FJ-3-15 INFO: 5. tree was built in 00:00:02.467 (Wall: 16-May 09:25:43.326) [... scoring session, included in log file attached... ] 05-16 09:26:31.479 1.2.3.4:54321 #6968 FJ-3-15 INFO: 6. tree was built in 00:00:02.331 (Wall: 16-May 09:26:31.479 )

I think that

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:0dc6b079-bb4a-4a21-9e26-e07d6b8241eb], thanks for the logs, this will help - it seems to me this could be related to scoring on GPUs.

exalate-issue-sync[bot] commented 1 year ago

Alessandro P. commented: Hi Michal, thanks for the rapid feedback!

I add that the same happens for example during the iteration for the XGBoost_2 flavor, whose max_depth = 20 is incompatible with gpu execution, and cpu execution is done.

The only difference is that every tree takes ~7 seconds on this training dataset, so the ratio learning/scoring may be different.

The main point here is that the ratio learning/scoring is too unbalanced towards the scoring step.

Even if the scoring takes 1 seconds, and building 5 trees takes 0.2 seconds (that is the timing with a little dataset), we still have scoring time 5/10 times bigger than "learning" time. In my opinion, this is something that should not happen, as it imply not-optimised resource usage.

What do you think about it ?

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6480 Assignee: Michal Kurka Reporter: Alessandro P. State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: log_h2o_score_tree_interval5.txt Attached By: Alessandro P. File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6480/log_h2o_score_tree_interval5.txt