Open mcb0035 opened 5 years ago
@mcb0035 Did you disable dynamic allocation? We don't support dynamic allocation yet. For classification case, might you have unbalanced data? if so, could you try out this build, where I throw an error on unbalanced classes, which comes from this PR, https://github.com/Azure/mmlspark/pull/618: --packages com.microsoft.ml.spark:mmlspark_2.11:0.17+83-11237da2 and --repositories https://mmlspark.azureedge.net/maven
for LightGBMRegressor, I'm not sure why you are seeing issues. What parameters are you running on databricks with?
Autoscaling is turned off for this cluster. Also, I notice your cluster uses the ML Databricks runtime. We plan to use Azure Machine Learning to drive this workflow, and AML documentation says to only use non-ML Databrucks runtime. I tried the new build (mmlspark_2.11:0.17+83-11237da2), but it fails at the line "from mmlspark import LightGBMClassifier, LightGBMRegressor" with the error "cannot import name 'LightGBMClassifier'". Did the names change in this build?
With the old build (mmlspark_2.11:0.17.dev27) LightGBMRegressor was able to fit() once after about 30 minutes, but then I tried the same command with the same dataset a second time and it hung again.
hi @mcb0035 , sorry about the trouble you are having, the imports have changed after the refactor by @mhamilton723 , the new imports are:
from mmlspark.train import LightGBMClassifier, LightGBMRegressor
That did not work either:
This is for mmlspark_2.11:0.17+83-11237da2.
sorry, it's from mmlspark.lightgbm import LightGBMClassifier
@mcb0035 did you run into any other issues? would you be able to close this issue? It sounds like we were able to resolve it when debugging by not using the new barrier execution mode (which doesn't seem to be working).
Yes, useBarrierExecutionMode=False and isUnbalance=True solved the problem. Thanks so much for your help.
@mcb0035 can this issue be closed or are there any other problems you are still encountering?
With the barrierExecutionModel=true, LightGBMClassifier() will randomly hang without an error (tested with 671b68892ace5967e60c7a064effd42dd5a21ec7).
I will also post an issue with more details for this, but I'm having the same issue with the SynapseML Vowpal Wabbit VowpalWabbitContextualBandit()
model. Setting useBarrierExecutionMode=False
and turning Auto-Scaling off didn't resolve it.
This
Usually there is no error, it just hangs on fit() for hours or days until I kill the job, with nothing output to stdout or stderr.
A few times the classifier successfully trained a model, which took about 1 second per iteration for the same inputs.
The inconsistency, lack of error messages, and occasional success for the same inputs make this difficult to debug. I will update if I get a more informative error message
describes the behavior perfectly.
Here is the basic cluster config (with autoscaling turned back on, since turning it off didn't help).
LightGBMClassifier and LightGBMRegressor both hang indefinitely most of the time on Databricks. My 6 node cluster has Databricks runtime 5.5, Scala 2.11, Spark 2.4.3, and mmlspark_2.11:0.17.dev27. I started using the development version of mmlspark after seeing comments claiming several issues in 0.17 were fixed there, but the problem persists. Another comment suggested adding useBarrierExecutionMode=True, so I did that as well, but that also did not fix the problem.
Usually there is no error, it just hangs on fit() for hours or days until I kill the job, with nothing output to stdout or stderr. Occasionally it fails with "connection refused" error, but as stated elsewhere, this is a generic error that does not point to the root cause.
A few times the classifier successfully trained a model, which took about 1 second per iteration for the same inputs. The inconsistency, lack of error messages, and occasional success for the same inputs make this difficult to debug. I will update if I get a more informative error message.