[python-package] [dask] Dask fit task just crashes

dpdrmj commented 7 months ago

Description

Whenever I run this code, the dask job crashes and all the workers get lost and then the task just hangs forever. While if I provide small size files then the same code works fine. (<100MB). I'm not sure what the issue is. Pasting the error below in "Additional Comments section"

Reproducible example

cluster = YarnCluster(
    environment="venv:////PATH_TO_PYENV/dask-pyenv",
    worker_vcores=2,
    worker_memory="16GiB",
    n_workers=2,
    worker_env={
        'CLASSPATH': 'PATH_TO_HADOOP/lib/native/libhdfs.so:OTHER_JARS_IN_HADOOP_CLASSPATH',
        'ARROW_LIBHDFS_DIR': 'PATH_TO_HADOOP/lib/native'
    },
    deploy_mode='local',
    name='test_dask'
)

client = Client(cluster.scheduler_address)
dask_model = lgb.DaskLGBMClassifier(
        client=client,
        boosting_type='gbdt',
        objective='binary',
        metric='binary_logloss,auc',
        max_bin=255,
        header=True,
        num_trees=100,
        max_depth=7,
        learning_rate=0.05,
        num_leaves=63,
        tree_learner='data',
        feature_fraction=0.8,
        bagging_freq=5,
        bagging_fraction=0.8,
        min_data_in_leaf=20,
        min_sum_hessian_in_leaf=5.0,
        lambda_l1=3,
        lambda_l2=100,
        cat_l2=20,
        cat_smooth=25,
        is_enable_sparse=True,
        use_two_round_loading=False,
        verbose=2,
        label_column='name:my_label_col',
)

train_ddf = dd.read_csv(
    path_list,
    delimiter=',',
    encoding='utf-8',
    storage_options={'driver': 'libhdfs3'}
).dropna().drop('auction_id', axis=1)

train_features = train_ddf.drop('my_label_col', axis=1)
train_target = train_ddf[['my_label_col']]
dask_model.fit(
    train_features,
    train_target,
    categorical_feature = ['feature1', 'feature2'....'featuren']
)

Environment info

LightGBM version or commit hash:

Dask version: 2023.10.0
Python version:3.10.12
Operating System:
Install method (conda, pip, source): pip

all the dependencies:

Package            Version
------------------ --------------
asttokens          2.4.1
bokeh              3.3.0
cffi               1.16.0
click              8.1.7
cloudpickle        3.0.0
comm               0.2.0
contourpy          1.1.1
cryptography       41.0.5
dask               2023.10.0
dask-yarn          0.9+2.g8eed5e2
debugpy            1.8.0
decorator          5.1.1
distributed        2023.10.0
exceptiongroup     1.1.3
executing          2.0.1
fsspec             2023.10.0
grpcio             1.59.0
importlib-metadata 6.8.0
ipython            8.17.2
jedi               0.19.1
Jinja2             3.1.2
joblib             1.3.2
jupyter_client     8.6.0
jupyter_core       5.5.0
lightgbm           4.1.0
locket             1.0.0
lz4                4.3.2
MarkupSafe         2.1.3
matplotlib-inline  0.1.6
msgpack            1.0.7
nest-asyncio       1.5.8
numpy              1.26.1
packaging          23.2
pandas             2.1.1
parso              0.8.3
partd              1.4.1
pexpect            4.8.0
Pillow             10.1.0
pip                22.0.2
platformdirs       4.0.0
prompt-toolkit     3.0.40
protobuf           4.24.4
psutil             5.9.6
ptyprocess         0.7.0
pure-eval          0.2.2
pyarrow            13.0.0
pycparser          2.21
Pygments           2.16.1
python-dateutil    2.8.2
pytz               2023.3.post1
PyYAML             6.0.1
pyzmq              25.1.1
scikit-learn       1.3.2
scipy              1.11.3
setuptools         59.6.0
six                1.16.0
skein              0.8.2
sortedcontainers   2.4.0
stack-data         0.6.3
tblib              3.0.0
threadpoolctl      3.2.0
toolz              0.12.0
tornado            6.3.3
traitlets          5.13.0
tzdata             2023.3
urllib3            2.0.7
wcwidth            0.2.9
xyzservices        2023.10.0
zict               3.0.0
zipp               3.17.0

Command(s) you used to install LightGBM

pip3 install lightgbm

Additional Comments

[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.934990
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.372672
[LightGBM] [Debug] init for col-wise cost 0.708685 seconds, init for row-wise cost 1.673264 seconds
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.912160 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 27836
[LightGBM] [Info] Number of data points in the train set: 4750592, number of used features: 49
[LightGBM] [Debug] Use subset for bagging
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.000239 -> initscore=-8.340142
[LightGBM] [Info] Start training from score -8.340142
[LightGBM] [Debug] Re-bagging, using 3801989 data to train
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb4d4013756, pid=944460, tid=0x00007fb55da09640
#
# JRE version: OpenJDK Runtime Environment (8.0_382-b05) (build 1.8.0_382-8u382-ga-1~22.04.1-b05)
# Java VM: OpenJDK 64-Bit Server VM (25.382-b05 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [lib_lightgbm.so+0x413756]  LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xe16
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# PATH_TO_APP_CACHE/appcache/application_1697132938548_0291/container_1697132938548_0291_01_000003/hs_err_pid944460.log

I had reported this on dask-distributed github (https://github.com/dask/distributed/issues/8341) but someone asked me to report to lightgbm.

jameslamb commented 7 months ago

Thanks for the report.

One clarifying question.... are you using Jython? I'm very surprised to see "A fatal error has been detected by the Java Runtime Environment" in the logs of a Python process.

If not Jython... can you help me understand where those JVM logs are coming from?

dpdrmj commented 7 months ago

I'm using python. Maybe because I'm reading files from HDFS and using Yarn as the resource manager? (it is a Yarn cluster)

jameslamb commented 7 months ago

reading files from HDFS

ahhhh ok, I missed storage_options={'driver': 'libhdfs3'} in the code since it was so far over. Got it!

I've reformatted your example to make it a bit easier to read. It seems like a very important detail (the fact that you're using dask-yarn) was excluded because you put the entire first line defining the cluster on the same line as the opening backticks for the code block.

If you're unsure how I got the example to look the way it does, please review https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.

jameslamb commented 7 months ago

if I provide small size files then the same code works fine. (<100MB)

I strongly suspect that the issues is that you're losing Dask workers due to out-of-memory issues.

I see you're using a 2-worker dask-yarn cluster, with 16GiB per worker (32GiB total).

The logs from the failed run show that you're using raw data with the following characteristics:

number of rows: 4,750,592
number of columns: 49

Even if they were all int32, the absolute minimum it'd take to represent that raw data in memory would be about 0.86GiB.

Add to that:

overheaad from pandas + Dask DataFrame representation
the corresponding LightGBM Dataset with 27,836 bins
the LightGBM Booster with up to 6300 leaves (num_leaves=100 * max_leaves=63)
other LightGBM and Dask framework overhead
the fact that there's some duplication across workers (e.g. both will hold a full copy of the LightGBM Booster and Dataset throughout training)

It seems very likely that a worker could be lost due to out-of-memory issues.

Dask can't see into memory allocations made in LightGBM's C++ code, so it can't prevent some forms of memory issues by spilling to disk the way it can with other Python objects. As described in #3775 (a not-yet-implemented feature request), LightGBM distributed training cannot currently handle the case where a worker process dies during training.

I know this isn't the answer you were hoping for, but I think the best path forward is some mix of giving training more memory and reducing the amount of memory used. For example, try some combination of the following:

all the suggestions at https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#grow-shallower-trees
smaller max_bins
smaller num_leaves
filter out some features if possible

If you're knowledgeable about Dask, we'd welcome any suggestions to improve memory usage in lightgbm.dask. Otherwise, subscribe to #3775 to be notified if/when someone tries to implement that feature.

dpdrmj commented 7 months ago

thanks for quick response @jameslamb . Earlier, I did not think it is out-of-memory issue, I tried with 64GB per worker as well and I still saw the same issues. Also, while checking the memory usage on dask dashboard, I saw that workers were not going above 4-6GB of memory usage before the workers getting lost. But since you are saying that "Dask can't see into memory allocations made in LightGBM's C++ code", perhaps the memory usage shown on dask dashboard is not correct? In that case I can try increasing the memory.

dpdrmj commented 7 months ago

total size of data is just 2GB so I think it can't be out-of-memory issue.

jameslamb commented 7 months ago

total size of data is just 2GB so I think it can't be out-of-memory issue

It depends on what you mean by "the data". I listed out some examples of that in my comments above.

dpdrmj commented 7 months ago

i meant raw data. I will try out with more memory but if 2GB of raw data is going to take that much memory then I probably should try different options. I'll try to use some of your suggestions above. Btw I tried to use synapseML earlier and I was facing some issues there as well but unfortunately community isn't as responsive there. Anyway, really appreciate quick responses and all the suggestions! Thanks a lot!

jameslamb commented 7 months ago

Besides the other approaches I provided for reducing memory usage, you could also try using Dask Array instead of Dask DataFrame.

Since it appears you're just loading raw data from files and directly passing it into LightGBM for training, it doesn't appear that you really need any of the functionality of Dask DataFrame. I'd expect Dask Array (made up of underlying numpy arrays, for example), to introduce less memory overhead than Dask DataFrame.

jameslamb commented 7 months ago

Other things to consider:

are you running anything else on these physical machines besides this LightGBM process?
are you trying to run multiple LightGBM training processes concurrently on the same machine?
have you confirmed that the underlying machines haven't been removed during training for reasons unrelated to LightGBM? (e.g. servers powered down for maintenance, Kubernetes pods evicted because the underlying node was removed, etc.)

jmoralez commented 7 months ago

Just want to add that the error caught by the JRE is a SIGSEGV (segfault), so there could be some weird interaction going on as well.

dpdrmj commented 7 months ago

sure, let me try will Dask Arrays. Machines aren't removed during training. These machines are hadoop datanodes but they are not consuming as much memory.

dragno99 commented 3 months ago

@dpdrmj hey, any luck in this ? I am also facing kind of same issue. i am using yarn, though dask is able to load datasets and preprocess them using dask_ml.preprocessing.Categorizer() but whenever i call dask_model.fit, everything just hangs including dask dashboard, scheduler and workers. i can also see error like 2024-03-24 04:36:40,227 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.49s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

dragno99 commented 3 months ago

@dpdrmj @jameslamb i have observed few things ( and now i am able to train on large dataset) 1) dont choose deploy mode as local because this will make a scheduler on local machine which creates many task and consumes memory and core. ( this can be the issue ), choose deploy_mode as remote while creating yarn cluster and also give it ample amout of memory ( 10 GB or 20 GB, depending on the number of partitions ) 2) keep partition less because too many partitions create too many tasks which require more scheduler memory. ( i tried to keep partition 20-30 from the start, with total 300 million rows, these tasks also lock the GIL ) 2) seems like there is some issue in the latest version of lightgbm ( version 4.* ), for me version 3.3.5 worked fine. 3) keep client and worker's enviroment same.

I am sharing stdout logs of both runs

when using latest lightgbm version

[LightGBM] [Warning] Connecting to rank 2 failed, waiting for 963 milliseconds
[LightGBM] [Warning] Connecting to rank 2 failed, waiting for 1251 milliseconds
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Connected to rank 2
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60
[LightGBM] [Warning] bagging_fraction is set=1, subsample=1.0 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Warning] lambda_l1 is set=0.225, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.225
[LightGBM] [Warning] lambda_l2 is set=4.512, reg_lambda=0.0 will be ignored. Current value: lambda_l2=4.512
[LightGBM] [Warning] bagging_freq is set=7, subsample_freq=0 will be ignored. Current value: bagging_freq=7
[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.
[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60
[LightGBM] [Warning] bagging_fraction is set=1, subsample=1.0 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Warning] lambda_l1 is set=0.225, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.225
[LightGBM] [Warning] lambda_l2 is set=4.512, reg_lambda=0.0 will be ignored. Current value: lambda_l2=4.512
[LightGBM] [Warning] bagging_freq is set=7, subsample_freq=0 will be ignored. Current value: bagging_freq=7
[LightGBM] [Info] Number of positive: 3357600, number of negative: 284091693
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.125302
[LightGBM] [Info] Total Bins 108792
[LightGBM] [Info] Number of data points in the train set: 97016363, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.025230 -> initscore=-3.654179
[LightGBM] [Info] Start training from score -3.654179
2024-03-27 22:13:21,398 - distributed.nanny - INFO - Worker process 3740454 was killed by signal 11
2024-03-27 22:13:21,860 - distributed.nanny - WARNING - Restarting worker
2024-03-27 22:13:22,468 - distributed.worker - INFO -       Start worker at:   tcp://***************
2024-03-27 22:13:22,468 - distributed.worker - INFO -          Listening to:   tcp://***************
2024-03-27 22:13:22,468 - distributed.worker - INFO -           Worker name:              dask.worker_0
2024-03-27 22:13:22,468 - distributed.worker - INFO -          dashboard at:         (tcp://***************)
2024-03-27 22:13:22,468 - distributed.worker - INFO - Waiting to connect to:   tcp://***************
2024-03-27 22:13:22,468 - distributed.worker - INFO - -------------------------------------------------
2024-03-27 22:13:22,468 - distributed.worker - INFO -               Threads:                         15
2024-03-27 22:13:22,468 - distributed.worker - INFO -                Memory:                 139.70 GiB
2024-03-27 22:13:22,468 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-wvgizkts
2024-03-27 22:13:22,468 - distributed.worker - INFO - -------------------------------------------------
2024-03-27 22:13:22,845 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-03-27 22:13:22,845 - distributed.worker - INFO -         Registered to:   tcp://***************
2024-03-27 22:13:22,846 - distributed.worker - INFO - -------------------------------------------------
2024-03-27 22:13:22,846 - distributed.core - INFO - Starting established connection to tcp://***************

and when using lightgbm 3.3.5

/hadoop/yarn/nm/usercache/nobody/appcache/application_1711607987999_0010/container_e114_1711607987999_0010_01_000006/.mamba/envs/my-custom-mamba-environment/lib/python3.10/site-packages/lightgbm/basic.py:2065: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
[LightGBM] [Info] Trying to bind port 59513...
[LightGBM] [Info] Binding port 59513 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 2
[LightGBM] [Info] Local rank: 3, total number of machines: 4
[LightGBM] [Warning] bagging_freq is set=7, subsample_freq=0 will be ignored. Current value: bagging_freq=7
[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60
[LightGBM] [Warning] lambda_l2 is set=4.512, reg_lambda=0.0 will be ignored. Current value: lambda_l2=4.512
[LightGBM] [Warning] num_threads is set=16, n_jobs=-1 will be ignored. Current value: num_threads=16
[LightGBM] [Warning] lambda_l1 is set=0.225, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.225
[LightGBM] [Warning] bagging_fraction is set=1, subsample=1.0 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Warning] bagging_freq is set=7, subsample_freq=0 will be ignored. Current value: bagging_freq=7
[LightGBM] [Warning] min_data_in_leaf is set=60, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=60
[LightGBM] [Warning] lambda_l2 is set=4.512, reg_lambda=0.0 will be ignored. Current value: lambda_l2=4.512
[LightGBM] [Warning] num_threads is set=16, n_jobs=-1 will be ignored. Current value: num_threads=16
[LightGBM] [Warning] lambda_l1 is set=0.225, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.225
[LightGBM] [Warning] bagging_fraction is set=1, subsample=1.0 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Info] Number of positive: 3357600, number of negative: 284091693
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.142935
[LightGBM] [Info] Total Bins 100128
[LightGBM] [Info] Number of data points in the train set: 49089345, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.025230 -> initscore=-3.654179
[LightGBM] [Info] Start training from score -3.654179
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and depth = 17

all of the enviroment and configurations are same except lightgbm.

microsoft / LightGBM