Open ocworld opened 4 years ago
@ocworld I believe https://github.com/Azure/mmlspark/issues/569 is already fixed, and this is a different issue. I think this might either be due to faulty network connection or possibly out of memory, but I can't be sure. I would have to try and repro the problem. I have worked on adding saving on each k iterations to try and mitigate this problem, but haven't sent out a PR yet for this functionality. Will leave this issue here for now in case we repro this. Do you see this error consistently?
New features that you work on is enjoyed to me. In my experiment, I implemented these functions myself with a hook object as a delegate. After your works about them, can I create pull request to you?
It is hard to reproduce this issue because of AWS EMR cost. I additionally implemented with retry feature in my app. After that, I have to plan check this error consistently.
hi @ocworld "can I create pull request to you" sure, feel free to create a pull request to mmlspark repository with any new features you might have "It is hard to reproduce this issue because of AWS EMR cost" no worries, I will just keep this issue open for now, hopefully I will get a reliable reproduction or encounter this as well at some point
@imatiach-msft I've tried to reproduce the error several times, it was not happened after this issue. I assume that it is a very rare case
It's not rare, in fact very reproducible (even on a single machine). Not restricted to classifiers (regressors are affected too). Not even restricted to SynapseML, because lightgbm_ray.RayLGBMRegressor
is affected too. Seems to be unrelated to the model training time alone, but to:
1) early stopping AND
2) a particular (useful in practice) combination of the number of estimators and learning rate (for less complex models it does not occur, only above certain number of estimators):
learning_rate=0.01
# n_estimators=100 # OK
# n_estimators=200 # OK
# n_estimators=400 # OK
# n_estimators=800 # # [LightGBM] [Fatal] Socket recv error, code: 104
n_estimators=1600 # [LightGBM] [Fatal] Socket recv error, code: 104
early_stopping_rounds = 10
It seems to be a LightGBM
+Ray
or lightgbm_ray
issue, because in an identical setup against the same Ray cluster XGBoost (xgboost_ray.RayXGBRegressor
) training works correctly.
Where should I create an issue with reproducible examples @eisber (to avoid double-posting):
Describe the bug I've met a socket error when training with binary objective.
[LightGBM] [Fatal] Socket recv error, code: 104 19/11/07 14:54:06 WARN LightGBMClassifier: LightGBM reached early termination on one worker, stopping training on worker. This message should rarely occur
I've tried to train a model during 500 iterations. However, when training 318 iteration, the error has been occurred and stop my training.
It is a similar issue with #569. However objective parameter value is different. In my experiment, the value is "binary".
To Reproduce Steps to reproduce the behavior, code snippets encouraged
Expected behavior A clear and concise description of what you expected to happen.
Info (please complete the following information):
Stacktrace
If the bug pertains to a specific feature please tag the appropriate CODEOWNER for better visibility
Additional context 19/11/07 14:41:25 INFO LightGBMClassifier: LightGBM worker calling LGBM_BoosterUpdateOneIter [LightGBM] [Fatal] Socket recv error, code: 104 19/11/07 14:54:06 WARN LightGBMClassifier: LightGBM reached early termination on one worker, stopping training on worker. This message should rarely occur