microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.06k stars 831 forks source link

[LightGBM] [Fatal] Socket recv error, code: 104 when training with binary objective. #728

Open ocworld opened 4 years ago

ocworld commented 4 years ago

Describe the bug I've met a socket error when training with binary objective.

[LightGBM] [Fatal] Socket recv error, code: 104 19/11/07 14:54:06 WARN LightGBMClassifier: LightGBM reached early termination on one worker, stopping training on worker. This message should rarely occur

I've tried to train a model during 500 iterations. However, when training 318 iteration, the error has been occurred and stop my training.

It is a similar issue with #569. However objective parameter value is different. In my experiment, the value is "binary".

To Reproduce Steps to reproduce the behavior, code snippets encouraged

Expected behavior A clear and concise description of what you expected to happen.

Info (please complete the following information):

Stacktrace

Please post the stacktrace here if applicable

If the bug pertains to a specific feature please tag the appropriate CODEOWNER for better visibility

Additional context 19/11/07 14:41:25 INFO LightGBMClassifier: LightGBM worker calling LGBM_BoosterUpdateOneIter [LightGBM] [Fatal] Socket recv error, code: 104 19/11/07 14:54:06 WARN LightGBMClassifier: LightGBM reached early termination on one worker, stopping training on worker. This message should rarely occur

imatiach-msft commented 4 years ago

@ocworld I believe https://github.com/Azure/mmlspark/issues/569 is already fixed, and this is a different issue. I think this might either be due to faulty network connection or possibly out of memory, but I can't be sure. I would have to try and repro the problem. I have worked on adding saving on each k iterations to try and mitigate this problem, but haven't sent out a PR yet for this functionality. Will leave this issue here for now in case we repro this. Do you see this error consistently?

ocworld commented 4 years ago

New features that you work on is enjoyed to me. In my experiment, I implemented these functions myself with a hook object as a delegate. After your works about them, can I create pull request to you?

It is hard to reproduce this issue because of AWS EMR cost. I additionally implemented with retry feature in my app. After that, I have to plan check this error consistently.

imatiach-msft commented 4 years ago

hi @ocworld "can I create pull request to you" sure, feel free to create a pull request to mmlspark repository with any new features you might have "It is hard to reproduce this issue because of AWS EMR cost" no worries, I will just keep this issue open for now, hopefully I will get a reliable reproduction or encounter this as well at some point

ocworld commented 4 years ago

@imatiach-msft I've tried to reproduce the error several times, it was not happened after this issue. I assume that it is a very rare case

mirekphd commented 2 years ago

It's not rare, in fact very reproducible (even on a single machine). Not restricted to classifiers (regressors are affected too). Not even restricted to SynapseML, because lightgbm_ray.RayLGBMRegressor is affected too. Seems to be unrelated to the model training time alone, but to: 1) early stopping AND 2) a particular (useful in practice) combination of the number of estimators and learning rate (for less complex models it does not occur, only above certain number of estimators):

learning_rate=0.01

# n_estimators=100 # OK
# n_estimators=200 # OK
# n_estimators=400 # OK
# n_estimators=800 # # [LightGBM] [Fatal] Socket recv error, code: 104
n_estimators=1600 # [LightGBM] [Fatal] Socket recv error, code: 104

early_stopping_rounds = 10
mirekphd commented 2 years ago

It seems to be a LightGBM+Ray or lightgbm_ray issue, because in an identical setup against the same Ray cluster XGBoost (xgboost_ray.RayXGBRegressor) training works correctly.

Where should I create an issue with reproducible examples @eisber (to avoid double-posting):