microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 832 forks source link

LightGBMClassifier, LightGBMRegressor hang indefinitely without error at fit() #623

Open mcb0035 opened 5 years ago

mcb0035 commented 5 years ago

LightGBMClassifier and LightGBMRegressor both hang indefinitely most of the time on Databricks. My 6 node cluster has Databricks runtime 5.5, Scala 2.11, Spark 2.4.3, and mmlspark_2.11:0.17.dev27. I started using the development version of mmlspark after seeing comments claiming several issues in 0.17 were fixed there, but the problem persists. Another comment suggested adding useBarrierExecutionMode=True, so I did that as well, but that also did not fix the problem.

Usually there is no error, it just hangs on fit() for hours or days until I kill the job, with nothing output to stdout or stderr. Occasionally it fails with "connection refused" error, but as stated elsewhere, this is a generic error that does not point to the root cause.

A few times the classifier successfully trained a model, which took about 1 second per iteration for the same inputs. The inconsistency, lack of error messages, and occasional success for the same inputs make this difficult to debug. I will update if I get a more informative error message.

imatiach-msft commented 5 years ago

@mcb0035 Did you disable dynamic allocation? We don't support dynamic allocation yet. For classification case, might you have unbalanced data? if so, could you try out this build, where I throw an error on unbalanced classes, which comes from this PR, https://github.com/Azure/mmlspark/pull/618: --packages com.microsoft.ml.spark:mmlspark_2.11:0.17+83-11237da2 and --repositories https://mmlspark.azureedge.net/maven

for LightGBMRegressor, I'm not sure why you are seeing issues. What parameters are you running on databricks with?

imatiach-msft commented 5 years ago

image

mcb0035 commented 5 years ago

image

Autoscaling is turned off for this cluster. Also, I notice your cluster uses the ML Databricks runtime. We plan to use Azure Machine Learning to drive this workflow, and AML documentation says to only use non-ML Databrucks runtime. I tried the new build (mmlspark_2.11:0.17+83-11237da2), but it fails at the line "from mmlspark import LightGBMClassifier, LightGBMRegressor" with the error "cannot import name 'LightGBMClassifier'". Did the names change in this build?

image

With the old build (mmlspark_2.11:0.17.dev27) LightGBMRegressor was able to fit() once after about 30 minutes, but then I tried the same command with the same dataset a second time and it hung again.

imatiach-msft commented 5 years ago

hi @mcb0035 , sorry about the trouble you are having, the imports have changed after the refactor by @mhamilton723 , the new imports are:

from mmlspark.train import LightGBMClassifier, LightGBMRegressor
mcb0035 commented 5 years ago

That did not work either: image

This is for mmlspark_2.11:0.17+83-11237da2.

imatiach-msft commented 5 years ago

sorry, it's from mmlspark.lightgbm import LightGBMClassifier

imatiach-msft commented 5 years ago

@mcb0035 did you run into any other issues? would you be able to close this issue? It sounds like we were able to resolve it when debugging by not using the new barrier execution mode (which doesn't seem to be working).

mcb0035 commented 5 years ago

Yes, useBarrierExecutionMode=False and isUnbalance=True solved the problem. Thanks so much for your help.

imatiach-msft commented 5 years ago

@mcb0035 can this issue be closed or are there any other problems you are still encountering?

anttisaukko commented 5 years ago

With the barrierExecutionModel=true, LightGBMClassifier() will randomly hang without an error (tested with 671b68892ace5967e60c7a064effd42dd5a21ec7).

mholmeslinder commented 1 year ago

I will also post an issue with more details for this, but I'm having the same issue with the SynapseML Vowpal Wabbit VowpalWabbitContextualBandit() model. Setting useBarrierExecutionMode=False and turning Auto-Scaling off didn't resolve it.

This

Usually there is no error, it just hangs on fit() for hours or days until I kill the job, with nothing output to stdout or stderr.

A few times the classifier successfully trained a model, which took about 1 second per iteration for the same inputs.

The inconsistency, lack of error messages, and occasional success for the same inputs make this difficult to debug. I will update if I get a more informative error message

describes the behavior perfectly.

Here is the basic cluster config (with autoscaling turned back on, since turning it off didn't help).

image