Deep-Tree XGBoost with PySparkling Fails Non-Deterministically on Certain Datasets

exalate-issue-sync[bot] commented 1 year ago

XGBoost training with PySparkling fails on some datasets and succeeds on other very similar datasets. The H2O context crashes, then the Spark executors die of ConnectionRefusedErrors. The failure seems to be non-deterministic. GBM does not share this problem.

I have three parquet datasets, each with 5,318,396 rows, 414 columns, 60 partitions, and the same schema, but with three very different observed failure rates. The columns have been renamed. Their schema is as follows:

Y: the dependent variable, a Double.
A-E: 410 Boolean features, organized into five groups based on their real-world meanings. Group D will be of particular interest to us.
F: Three numeric features, two Integer, one Double.

I have uploaded my datasets to a Google Drive here: https://drive.google.com/drive/folders/1w1axJE9B-XzBOAxLil0sApQdXRQ7G8t5?usp=sharing

The datasets are:

data_real: This is my actual dataset. XGBoost training on this dataset crashes almost every single time.
data_fake: The values of this dataset were generated by independently resampling from the “real” data. For each column, a histogram was created of the possible values that the column takes in the “real” data, then each row of the “fake” data was created by independently sampling from each histogram (the values of the “Y” column were rounded to three decimal places before building the histogram). Thus each column has the same distribution of values as in the “real” data, but all correlation and meaningful signal has been destroyed. XGBoost training on this dataset succeeds almost every time.
data_fake_d: This dataset was generated in the same way as the “fake” data, except ONLY the “D” columns were resampled; the rest were left as they are in the “real” data. That is, only the signal from the “D” columns has been destroyed, leaving the rest intact. XGBoost training on this dataset succeeds around 80% of the time. Analogous datasets were made for A,B,C,E,F, and Y, but they all had close to 100% failure rate, like the “real” data, so I did not include them here.

Also in the drive are:

train.py: The python code that I used for training.
README.txt: A copy of this issue description.

Above is all of the really key information. Below I list various details, in order of decreasing expected relevance:

I’m using H2O version 3.24.0.3. This was the latest version of H2O when I began my experiments, and I have not yet redone them with the most up-to-date version. I have looked at the change logs and it does not appear that my issue has been fixed, though I do not have sufficient expertise to know that for sure.

My Spark version is 2.2.1, though I have tried 2.3 and 2.4 without noticeable difference in outcome.

Experiment design:

My definition of “Failure” is that training takes more than an hour to complete. With my setup, training on any dataset either succeeds in about twenty minutes or stalls out, seemingly unable to train any additional trees. The Spark executors die one at a time, and eventually I have to come in and kill the job manually. Because of the non-deterministic nature of the failure, it is very difficult to conclusively eliminate any variables as being relevant, especially since I am working on a spark cluster that is shared with many other developers in my department. For this reason, I wrote a bash script that runs the training code on a loop with a one-hour timeout, choosing one of the three datasets at random and recording the results (success or timeout). This means that it is possible, though unlikely, that some of the runs marked as failures would have in fact succeeded if given more than one hour. With my normal 32 GB driver, the longest duration of a recorded success was under 30 minutes. With the larger 128 GB driver, there was one recorded success at 52 minutes (on data_fake_d), and the rest were under 35 minutes.

Normal observed failure rates (32GB driver, max_depth=30): data_real: Failed 46/46 times (100.0%) data_fake: Failed 1/45 times (2.2%) data_fake_d: Failed 14/68 times (20.6%)

I repeated these experiments with a much larger H2O driver process (128GB instead of my normal 32 GB). This had an impact on failure rate that is difficult for me to interpret: Large driver observed failure rates (128GB driver, max_depth=30): data_real: Failed 36/40 times (90.0%) data_fake: Failed 0/26 times (0.0%) data_fake_d: Failed 11/30 times (36.7%)

Setting max_depth to 5 instead of 30 causes all three datasets to always succeed, usually in about ten minutes: Shallow trees observed failure rates (32GB driver, max_depth=5): data_real: Failed 0/11 times (0.0%) data_fake: Failed 0/5 times (0.0%) data_fake_d: Failed 0/10 times (0.0%)

All of these experimental results are subject to the caveat that the bug is non-deterministic, so any experimental run has the potential to be a fluke.

Spark Settings

spark.cores.max: 120
spark.executor.cores: 5
spark.default.parallelism: 1024
spark.executor.memory: 12g
spark.memory.fraction: 0.6
spark.memory.storageFraction: 0.5

Cluster Details

30 cores per worker
90GB memory per worker (3GB per core)

None of the features are constructed directly from other features, with the exception of d_0, which is defined as (d_12 OR d_25 OR d_36 OR d_55 OR d_69). This definition is not used and does not hold in the ‘data_fake’ or ‘data_fake_d’ datasets.

Overall data size (number of rows) seems to have a slight impact on failure chance, with larger data more likely to fail, but I have not done sufficient experimentation to make this observation rigorous.

exalate-issue-sync[bot] commented 1 year ago

Nick Lavers commented: I repeated the baseline experiment using the most recent H2O version, 3.24.0.5 (32GB driver, max_depth=30). The results were very similar, indicating that the issue persists: data_real: Failed 40/40 times (100.0%) data_fake: Failed 1/29 times (3.4%) data_fake_d: Failed 10/30 times (33.3%)

exalate-issue-sync[bot] commented 1 year ago

Nick Lavers commented: I again repeated the baseline experiment using the most recent H2O version, 3.26.0.2 (32GB driver, max_depth=30). The results were the same: data_real: Failed 25/25 times (100.0%) data_fake: Failed 0/38 times (0.0%) data_fake_d: Failed 10/36 times (27.8%)

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6618 Assignee: New H2O Bugs Reporter: Nick Lavers State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

h2oai / h2o-3

Deep-Tree XGBoost with PySparkling Fails Non-Deterministically on Certain Datasets #9013