Open exalate-issue-sync[bot] opened 1 year ago
Michal Kurka commented: [~accountid:557058:0dc6b079-bb4a-4a21-9e26-e07d6b8241eb] thanks for reporting, we are looking into ways to speed up the scoring to an extent that it won't be slowing down AutoML training like this. This is an issue that will be addressed.
Can you please share the full log? What OS are you using?
We have a bug on OS X where the scoring is very slow and that we are fixing in the next release: PUBDEV-6476
Alessandro P. commented: I have added some logs. I run h2o on a linux machine, with 250gb ram, 24 threads.
from those 2 lines, you can infer that the tree construction time takes 5 * 2.5s = 12.5s for every scoring session, that lasts ~50s.
05-16 09:25:43.326 1.2.3.4:54321 #6968 FJ-3-15 INFO: 5. tree was built in 00:00:02.467 (Wall: 16-May 09:25:43.326) [... scoring session, included in log file attached... ] 05-16 09:26:31.479 1.2.3.4:54321 #6968 FJ-3-15 INFO: 6. tree was built in 00:00:02.331 (Wall: 16-May 09:26:31.479 )
I think that
Michal Kurka commented: [~accountid:557058:0dc6b079-bb4a-4a21-9e26-e07d6b8241eb], thanks for the logs, this will help - it seems to me this could be related to scoring on GPUs.
Alessandro P. commented: Hi Michal, thanks for the rapid feedback!
I add that the same happens for example during the iteration for the XGBoost_2 flavor, whose max_depth = 20 is incompatible with gpu execution, and cpu execution is done.
The only difference is that every tree takes ~7 seconds on this training dataset, so the ratio learning/scoring may be different.
The main point here is that the ratio learning/scoring is too unbalanced towards the scoring step.
Even if the scoring takes 1 seconds, and building 5 trees takes 0.2 seconds (that is the timing with a little dataset), we still have scoring time 5/10 times bigger than "learning" time. In my opinion, this is something that should not happen, as it imply not-optimised resource usage.
What do you think about it ?
JIRA Issue Migration Info
Jira Issue: PUBDEV-6480 Assignee: Michal Kurka Reporter: Alessandro P. State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A
Attachments From Jira
Attachment Name: log_h2o_score_tree_interval5.txt Attached By: Alessandro P. File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6480/log_h2o_score_tree_interval5.txt
Hi, i noticed that performances for XGBoost-related models via AutoML are really bad for a dataset that I have, composed of ~15M rows, and 126 predictors.
I noticed that in the logs:
05-09 15:25:49.075 127.0.0.1:54321 27825 FJ-3-57 INFO: 46. tree was built in 00:00:02.388 (Wall: 09-May 15:25:49.075) 05-09 15:25:51.650 127.0.0.1:54321 27825 FJ-3-57 INFO: 47. tree was built in 00:00:02.575 (Wall: 09-May 15:25:51.650) 05-09 15:25:54.086 127.0.0.1:54321 27825 FJ-3-57 INFO: 48. tree was built in 00:00:02.435 (Wall: 09-May 15:25:54.085) 05-09 15:25:56.666 127.0.0.1:54321 27825 FJ-3-57 INFO: 49. tree was built in 00:00:02.580 (Wall: 09-May 15:25:56.666) 05-09 15:25:59.205 127.0.0.1:54321 27825 FJ-3-57 INFO: 50. tree was built in 00:00:02.539 (Wall: 09-May 15:25:59.205)
[... scoring round, lasts 60s ...]
05-09 15:27:02.595 127.0.0.1:54321 27825 FJ-3-57 INFO: 51. tree was built in 00:00:02.361 (Wall: 09-May 15:27:02.595)
That is, a tree is built every ~2.5 seconds. The problem is that every 5 trees, a new scoring round is launched, that lasts ~60 seconds. So, every ~15 seconds spent on building trees, 60s are spent scoring.
I think that the score_tree_interval MUST be configurable for AutoML too, or at least increased by 10 times.