Open exalate-issue-sync[bot] opened 1 year ago
Nick Lavers commented: I repeated the baseline experiment using the most recent H2O version, 3.24.0.5 (32GB driver, max_depth=30). The results were very similar, indicating that the issue persists: data_real: Failed 40/40 times (100.0%) data_fake: Failed 1/29 times (3.4%) data_fake_d: Failed 10/30 times (33.3%)
Nick Lavers commented: I again repeated the baseline experiment using the most recent H2O version, 3.26.0.2 (32GB driver, max_depth=30). The results were the same: data_real: Failed 25/25 times (100.0%) data_fake: Failed 0/38 times (0.0%) data_fake_d: Failed 10/36 times (27.8%)
JIRA Issue Migration Info
Jira Issue: PUBDEV-6618 Assignee: New H2O Bugs Reporter: Nick Lavers State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
XGBoost training with PySparkling fails on some datasets and succeeds on other very similar datasets. The H2O context crashes, then the Spark executors die of ConnectionRefusedErrors. The failure seems to be non-deterministic. GBM does not share this problem.
I have three parquet datasets, each with 5,318,396 rows, 414 columns, 60 partitions, and the same schema, but with three very different observed failure rates. The columns have been renamed. Their schema is as follows:
I have uploaded my datasets to a Google Drive here: https://drive.google.com/drive/folders/1w1axJE9B-XzBOAxLil0sApQdXRQ7G8t5?usp=sharing
The datasets are:
Also in the drive are:
Above is all of the really key information. Below I list various details, in order of decreasing expected relevance:
I’m using H2O version 3.24.0.3. This was the latest version of H2O when I began my experiments, and I have not yet redone them with the most up-to-date version. I have looked at the change logs and it does not appear that my issue has been fixed, though I do not have sufficient expertise to know that for sure.
My Spark version is 2.2.1, though I have tried 2.3 and 2.4 without noticeable difference in outcome.
Experiment design:
My definition of “Failure” is that training takes more than an hour to complete. With my setup, training on any dataset either succeeds in about twenty minutes or stalls out, seemingly unable to train any additional trees. The Spark executors die one at a time, and eventually I have to come in and kill the job manually. Because of the non-deterministic nature of the failure, it is very difficult to conclusively eliminate any variables as being relevant, especially since I am working on a spark cluster that is shared with many other developers in my department. For this reason, I wrote a bash script that runs the training code on a loop with a one-hour timeout, choosing one of the three datasets at random and recording the results (success or timeout). This means that it is possible, though unlikely, that some of the runs marked as failures would have in fact succeeded if given more than one hour. With my normal 32 GB driver, the longest duration of a recorded success was under 30 minutes. With the larger 128 GB driver, there was one recorded success at 52 minutes (on data_fake_d), and the rest were under 35 minutes.
Normal observed failure rates (32GB driver, max_depth=30): data_real: Failed 46/46 times (100.0%) data_fake: Failed 1/45 times (2.2%) data_fake_d: Failed 14/68 times (20.6%)
I repeated these experiments with a much larger H2O driver process (128GB instead of my normal 32 GB). This had an impact on failure rate that is difficult for me to interpret: Large driver observed failure rates (128GB driver, max_depth=30): data_real: Failed 36/40 times (90.0%) data_fake: Failed 0/26 times (0.0%) data_fake_d: Failed 11/30 times (36.7%)
Setting max_depth to 5 instead of 30 causes all three datasets to always succeed, usually in about ten minutes: Shallow trees observed failure rates (32GB driver, max_depth=5): data_real: Failed 0/11 times (0.0%) data_fake: Failed 0/5 times (0.0%) data_fake_d: Failed 0/10 times (0.0%)
All of these experimental results are subject to the caveat that the bug is non-deterministic, so any experimental run has the potential to be a fluke.
Spark Settings
Cluster Details
None of the features are constructed directly from other features, with the exception of d_0, which is defined as (d_12 OR d_25 OR d_36 OR d_55 OR d_69). This definition is not used and does not hold in the ‘data_fake’ or ‘data_fake_d’ datasets.
Overall data size (number of rows) seems to have a slight impact on failure chance, with larger data more likely to fail, but I have not done sufficient experimentation to make this observation rigorous.