Open dpdrmj opened 7 months ago
Thanks for the report.
One clarifying question.... are you using Jython? I'm very surprised to see "A fatal error has been detected by the Java Runtime Environment"
in the logs of a Python process.
If not Jython... can you help me understand where those JVM logs are coming from?
I'm using python. Maybe because I'm reading files from HDFS and using Yarn as the resource manager? (it is a Yarn cluster)
reading files from HDFS
ahhhh ok, I missed storage_options={'driver': 'libhdfs3'}
in the code since it was so far over. Got it!
I've reformatted your example to make it a bit easier to read. It seems like a very important detail (the fact that you're using dask-yarn
) was excluded because you put the entire first line defining the cluster on the same line as the opening backticks for the code block.
If you're unsure how I got the example to look the way it does, please review https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.
if I provide small size files then the same code works fine. (<100MB)
I strongly suspect that the issues is that you're losing Dask workers due to out-of-memory issues.
I see you're using a 2-worker dask-yarn
cluster, with 16GiB per worker (32GiB total).
The logs from the failed run show that you're using raw data with the following characteristics:
Even if they were all int32
, the absolute minimum it'd take to represent that raw data in memory would be about 0.86GiB.
Add to that:
pandas
+ Dask DataFrame representationDataset
with 27,836 binsBooster
with up to 6300 leaves (num_leaves=100 * max_leaves=63
)Booster
and Dataset
throughout training)It seems very likely that a worker could be lost due to out-of-memory issues.
Dask can't see into memory allocations made in LightGBM's C++ code, so it can't prevent some forms of memory issues by spilling to disk the way it can with other Python objects. As described in #3775 (a not-yet-implemented feature request), LightGBM distributed training cannot currently handle the case where a worker process dies during training.
I know this isn't the answer you were hoping for, but I think the best path forward is some mix of giving training more memory and reducing the amount of memory used. For example, try some combination of the following:
max_bins
num_leaves
If you're knowledgeable about Dask, we'd welcome any suggestions to improve memory usage in lightgbm.dask
. Otherwise, subscribe to #3775 to be notified if/when someone tries to implement that feature.
thanks for quick response @jameslamb . Earlier, I did not think it is out-of-memory issue, I tried with 64GB per worker as well and I still saw the same issues. Also, while checking the memory usage on dask dashboard, I saw that workers were not going above 4-6GB of memory usage before the workers getting lost. But since you are saying that "Dask can't see into memory allocations made in LightGBM's C++ code", perhaps the memory usage shown on dask dashboard is not correct? In that case I can try increasing the memory.
total size of data is just 2GB so I think it can't be out-of-memory issue.
total size of data is just 2GB so I think it can't be out-of-memory issue
It depends on what you mean by "the data". I listed out some examples of that in my comments above.
i meant raw data. I will try out with more memory but if 2GB of raw data is going to take that much memory then I probably should try different options. I'll try to use some of your suggestions above. Btw I tried to use synapseML earlier and I was facing some issues there as well but unfortunately community isn't as responsive there. Anyway, really appreciate quick responses and all the suggestions! Thanks a lot!
Besides the other approaches I provided for reducing memory usage, you could also try using Dask Array instead of Dask DataFrame.
Since it appears you're just loading raw data from files and directly passing it into LightGBM for training, it doesn't appear that you really need any of the functionality of Dask DataFrame. I'd expect Dask Array (made up of underlying numpy
arrays, for example), to introduce less memory overhead than Dask DataFrame.
Other things to consider:
Just want to add that the error caught by the JRE is a SIGSEGV (segfault), so there could be some weird interaction going on as well.
sure, let me try will Dask Arrays. Machines aren't removed during training. These machines are hadoop datanodes but they are not consuming as much memory.
@dpdrmj hey, any luck in this ? I am also facing kind of same issue. i am using yarn, though dask is able to load datasets and preprocess them using dask_ml.preprocessing.Categorizer()
but whenever i call dask_model.fit, everything just hangs including dask dashboard, scheduler and workers. i can also see error like 2024-03-24 04:36:40,227 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.49s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
@dpdrmj @jameslamb i have observed few things ( and now i am able to train on large dataset) 1) dont choose deploy mode as local because this will make a scheduler on local machine which creates many task and consumes memory and core. ( this can be the issue ), choose deploy_mode as remote while creating yarn cluster and also give it ample amout of memory ( 10 GB or 20 GB, depending on the number of partitions ) 2) keep partition less because too many partitions create too many tasks which require more scheduler memory. ( i tried to keep partition 20-30 from the start, with total 300 million rows, these tasks also lock the GIL ) 2) seems like there is some issue in the latest version of lightgbm ( version 4.* ), for me version 3.3.5 worked fine. 3) keep client and worker's enviroment same.
I am sharing stdout logs of both runs
when using latest lightgbm version
[LightGBM] [Warning] Connecting to rank 2 failed, waiting for 963 milliseconds
[LightGBM] [Warning] Connecting to rank 2 failed, waiting for 1251 milliseconds
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Connected to rank 2
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60
[LightGBM] [Warning] bagging_fraction is set=1, subsample=1.0 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Warning] lambda_l1 is set=0.225, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.225
[LightGBM] [Warning] lambda_l2 is set=4.512, reg_lambda=0.0 will be ignored. Current value: lambda_l2=4.512
[LightGBM] [Warning] bagging_freq is set=7, subsample_freq=0 will be ignored. Current value: bagging_freq=7
[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60
[LightGBM] [Warning] bagging_fraction is set=1, subsample=1.0 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Warning] lambda_l1 is set=0.225, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.225
[LightGBM] [Warning] lambda_l2 is set=4.512, reg_lambda=0.0 will be ignored. Current value: lambda_l2=4.512
[LightGBM] [Warning] bagging_freq is set=7, subsample_freq=0 will be ignored. Current value: bagging_freq=7
[LightGBM] [Info] Number of positive: 3357600, number of negative: 284091693
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.125302
[LightGBM] [Info] Total Bins 108792
[LightGBM] [Info] Number of data points in the train set: 97016363, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.025230 -> initscore=-3.654179
[LightGBM] [Info] Start training from score -3.654179
2024-03-27 22:13:21,398 - distributed.nanny - INFO - Worker process 3740454 was killed by signal 11
2024-03-27 22:13:21,860 - distributed.nanny - WARNING - Restarting worker
2024-03-27 22:13:22,468 - distributed.worker - INFO - Start worker at: tcp://***************
2024-03-27 22:13:22,468 - distributed.worker - INFO - Listening to: tcp://***************
2024-03-27 22:13:22,468 - distributed.worker - INFO - Worker name: dask.worker_0
2024-03-27 22:13:22,468 - distributed.worker - INFO - dashboard at: (tcp://***************)
2024-03-27 22:13:22,468 - distributed.worker - INFO - Waiting to connect to: tcp://***************
2024-03-27 22:13:22,468 - distributed.worker - INFO - -------------------------------------------------
2024-03-27 22:13:22,468 - distributed.worker - INFO - Threads: 15
2024-03-27 22:13:22,468 - distributed.worker - INFO - Memory: 139.70 GiB
2024-03-27 22:13:22,468 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-wvgizkts
2024-03-27 22:13:22,468 - distributed.worker - INFO - -------------------------------------------------
2024-03-27 22:13:22,845 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-03-27 22:13:22,845 - distributed.worker - INFO - Registered to: tcp://***************
2024-03-27 22:13:22,846 - distributed.worker - INFO - -------------------------------------------------
2024-03-27 22:13:22,846 - distributed.core - INFO - Starting established connection to tcp://***************
and when using lightgbm 3.3.5
/hadoop/yarn/nm/usercache/nobody/appcache/application_1711607987999_0010/container_e114_1711607987999_0010_01_000006/.mamba/envs/my-custom-mamba-environment/lib/python3.10/site-packages/lightgbm/basic.py:2065: UserWarning: Using categorical_feature in Dataset.
_log_warning('Using categorical_feature in Dataset.')
[LightGBM] [Info] Trying to bind port 59513...
[LightGBM] [Info] Binding port 59513 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 2
[LightGBM] [Info] Local rank: 3, total number of machines: 4
[LightGBM] [Warning] bagging_freq is set=7, subsample_freq=0 will be ignored. Current value: bagging_freq=7
[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60
[LightGBM] [Warning] lambda_l2 is set=4.512, reg_lambda=0.0 will be ignored. Current value: lambda_l2=4.512
[LightGBM] [Warning] num_threads is set=16, n_jobs=-1 will be ignored. Current value: num_threads=16
[LightGBM] [Warning] lambda_l1 is set=0.225, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.225
[LightGBM] [Warning] bagging_fraction is set=1, subsample=1.0 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Warning] bagging_freq is set=7, subsample_freq=0 will be ignored. Current value: bagging_freq=7
[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60
[LightGBM] [Warning] lambda_l2 is set=4.512, reg_lambda=0.0 will be ignored. Current value: lambda_l2=4.512
[LightGBM] [Warning] num_threads is set=16, n_jobs=-1 will be ignored. Current value: num_threads=16
[LightGBM] [Warning] lambda_l1 is set=0.225, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.225
[LightGBM] [Warning] bagging_fraction is set=1, subsample=1.0 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Info] Number of positive: 3357600, number of negative: 284091693
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.142935
[LightGBM] [Info] Total Bins 100128
[LightGBM] [Info] Number of data points in the train set: 49089345, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.025230 -> initscore=-3.654179
[LightGBM] [Info] Start training from score -3.654179
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
all of the enviroment and configurations are same except lightgbm.
Description
Whenever I run this code, the dask job crashes and all the workers get lost and then the task just hangs forever. While if I provide small size files then the same code works fine. (<100MB). I'm not sure what the issue is. Pasting the error below in "Additional Comments section"
Reproducible example
Environment info
LightGBM version or commit hash:
all the dependencies:
Command(s) you used to install LightGBM
Additional Comments
I had reported this on dask-distributed github (https://github.com/dask/distributed/issues/8341) but someone asked me to report to lightgbm.