microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.04k stars 830 forks source link

LightGBM stuck at "reduce at LightGBMClassifier.scala:150" #1053

Open OldDreamHunter opened 3 years ago

OldDreamHunter commented 3 years ago

I have already noticed the issue https://github.com/Azure/mmlspark/issues/542, but the answer cannot solve my problem.

I have a dataset nearly 72GB and 145 columns. My spark config is spark-submit \ --master yarn \ --deploy-mode client \ --executor-memory 15g \ --driver-memory 15g \ --executor-cores 1 \ --num-executors 20 \ --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1 \ --conf spark.default.parallelism=5000 \ --conf spark.sql.shuffle.partitions=5000 \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.memory.storageFraction=0.3 \ --conf spark.executor.memoryOverhead=15g \ --conf spark.driver.maxResultSize=10g \

if I reduce the dataset size to 24 GB, I could train the model in 40 minutes. But if I increase the dataset to 72GB, the training process would be stuck at "reduce at LightGBMClassifier.scala:150" and report some failed information, "ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 128370 ms", "java.lang.Exception: Dataset create call failed in LightGBM with error: Socket recv error, code: 104", "java.net.ConnectException: Connection refused"

AB#1188553

welcome[bot] commented 3 years ago

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

imatiach-msft commented 3 years ago

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47 What are the parameters to lightgbm?

OldDreamHunter commented 3 years ago

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47 What are the parameters to lightgbm?

Thanks for your reply @imatiach-msft , I don't increase the socket timeout and would try it. And the parameters of my model as described below.

lgb = LightGBMClassifier( objective="binary", boostingType='gbdt', isUnbalance=True, featuresCol='features', labelCol='label', maxBin=64, earlyStoppingRound=100, learningRate=0.5, maxDepth=6, numLeaves=48, lambdaL1=0.8, lambdaL2=45.0, baggingFraction=0.7, featureFraction=0.7, numIterations=200)

OldDreamHunter commented 3 years ago

hi @OldDreamHunter sorry about the trouble you are having. Have you tried increasing the socket timeout: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L47 What are the parameters to lightgbm?

hi, @imatiach-msft, I have increased the timeout and changed the parallelism type to "voting_parallel", but the job still failed as "reduce at LightGBMBase.scala:230" with the failure reason of "Job aborted due to stage failure: Task 8 in stage 4.0 failed 4 times, most recent failure: Lost task 8.3 in stage 4.0 (TID 6027, pro-dchadoop-195-81, executor 22): java.net.ConnectException: Connection refused (Connection refused)"

boostingType='gbdt', isUnbalance=True, featuresCol='features', labelCol='label', maxBin=64, earlyStoppingRound=100, learningRate=0.5, maxDepth=5, numLeaves=32, lambdaL1=7.0, lambdaL2=7.0, baggingFraction=0.7, featureFraction=0.7, numIterations=200, parallelism='voting_parallel', timeout=120000.0)

imatiach-msft commented 3 years ago

@OldDreamHunter I think that is a red herring, the real error is on one of the other nodes. Can you send all of the unique task error messages? Please ignore the connection refused error.

imatiach-msft commented 3 years ago

you can also try to set useBarrierExecutionMode=True, I think it might give a better error message

imatiach-msft commented 3 years ago

I would only use voting_parallel if you have a high number of features, see guide: https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html

image

icankeep commented 3 years ago

same problem. Everything will work well when I reduce the number of training data

Simon-LLong commented 2 years ago

same problem. Voting Parallel works fine, but accuracy is very low. Much data is skipped.

imatiach-msft commented 2 years ago

@Simon-LLong sorry about the problems you are encountering. Indeed Voting Parallel can give lower accuracy, but with much better speedup and lower memory usage.

Can you also please try the new mode: useSingleDatasetMode = True numThreads = num cores - 1 These two PRs should resolve this:

1222

1282

In performance testing we saw big speedup with new single dataset mode and numThreads set to num cores -1 (as well as lower memory usage). The two PRs above will be available in 0.9.5 or you can get them with the latest build right now. In 0.9.5 these params will be set by default, but in earlier versions like currently released 0.9.4 you can set them directly.

For more information on the new single dataset mode please see the PR description:

1066

This new mode was created after extensive internal benchmarking.

I have some ideas on how a streaming mode can also be added to distributed lightgbm, where data is streamed into the native histogram binned representation, which should use a small fraction of the total spark dataset when everything is loaded in memory. It might be a little slower to setup, but it should vastly reduce memory usage. This is something I will be looking into in the near-future.

nitinmnsn commented 2 years ago

numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores. Is this number of cores on my executor node, number of cores in my executor or number of cores on my cluster?