microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.05k stars 830 forks source link

LightGBMClassifier: multiclass training, workers out of sync and early termination #569

Open sunilsu opened 5 years ago

sunilsu commented 5 years ago

I am training a lightGBM classifier on a dataset with 18 classes, ~ 2M rows and ~5900 columns. The data is pretty sparse (density ~ 15%). The class distribution is unbalanced.

lgbm = LightGBMClassifier(objective='multiclass', parallelism='voting_parallel')
model = lgbm.fit(train)

I have a 20 node yarn cluster, each node has 8 cores and 64G RAM. Memory is plenty for this dataset.

Looking at the executor log, I see that one of the lightGBm worker goes out of sync with the rest. If every worker is on iteration x, this worker is at a higher iteration (x + y). Eventually, I get this error,

19/05/16 20:12:04 INFO LightGBMClassifier: LightGBM running iteration: 82 with result: 0 and is finished: false
19/05/16 20:12:04 INFO LightGBMClassifier: LightGBM worker calling LGBM_BoosterUpdateOneIter
19/05/16 20:12:04 INFO LightGBMClassifier: LightGBM running iteration: 82 with result: 0 and is finished: false
19/05/16 20:12:04 INFO LightGBMClassifier: LightGBM worker calling LGBM_BoosterUpdateOneIter
19/05/16 20:12:04 INFO LightGBMClassifier: LightGBM running iteration: 82 with result: 0 and is finished: false
19/05/16 20:12:04 INFO LightGBMClassifier: LightGBM worker calling LGBM_BoosterUpdateOneIter
19/05/16 20:12:04 INFO LightGBMClassifier: LightGBM running iteration: 82 with result: 0 and is finished: false
19/05/16 20:12:04 INFO LightGBMClassifier: LightGBM worker calling LGBM_BoosterUpdateOneIter
19/05/16 20:12:14 INFO LightGBMClassifier: LightGBM running iteration: 99 with result: 0 and is finished: false
[LightGBM] [Fatal] Socket recv error, code: 104
19/05/16 20:12:15 WARN LightGBMClassifier: LightGBM reached early termination on one worker, stopping training on worker. This message should rarely occur

Notice all workers are on iteration 82 except one which is on 99. The program continues to run, but not sure how many boosting iterations were run. Also, the warning printed towards the end points something not working right.

sunilsu commented 5 years ago

Hi @imatiach-msft, any updates on this?

imatiach-msft commented 5 years ago

@sunilsu another user seems to have encountered a very similar issue here: https://github.com/Azure/mmlspark/issues/609 I was able to reproduce the problem in a test case locally: https://github.com/imatiach-msft/mmlspark/commit/c2568b11ab6e4f74a7f349d84cb8358437eed2b2

I've confirmed the following:

1.) If the labels on a partition skip a value, eg [0 to j] inclusive, and [j+2 to k], lightgbm multiclass classifier gets stuck 2.) if partitions have different labels but some have fewer than others, eg one has 0 to k and another has 0 to k+1, lightgbm multiclass classifier finishes

My recommendation is to ensure that all partitions have all labels from 0 to total number of labels.

I'm still looking into the issue to figure out the best fix.

imatiach-msft commented 5 years ago

interestingly they had the same problem where some workers got out of sync

ocworld commented 4 years ago

I've met this error on mmlspark 1.0.0-rc1 with lightgbm 2.3.100.

A different point with this issue is that my objective parameter's value is "binary".