Open bottergpt opened 11 months ago
Hey @zhangqibot :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.
@svotaw @svotaw Hi Scott, could you please help me with this. I've looked into related issues, but setting useBarrierExecutionMode=True has not resolved the problem. My training set is made up of close to 0.3 billion rows of samples and includes 300 features.
Hi @svotaw any update on this?
I'm also getting these type of errors when training LightGBM with large volumes of data (let's say +6M rows) I've tried different SynapseML versions and configurations (using useSingleDatasetMode, useBarrierExecutionMode, streaming mode, different synapse versions....) but never find a stable config that works with those volumes.
org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(20, 0) finished unsuccessfully.
)"useBarrierExecutionMode": True, "useSingleDatasetMode": True, "executionMode": "streaming", "microBatchSize": 50
) and same dataset that worked in v0.11.1 it fails in v0.11.2 (ofc changing executionMode
to dataTransferMode
) returning this barrier stage error. More cluster resources don't seem to solve the problem.
Are you still working on that? I see many issues related but no solution. https://github.com/microsoft/SynapseML/issues/1836 https://github.com/microsoft/SynapseML/issues/1840 https://github.com/microsoft/SynapseML/issues/1967 https://github.com/microsoft/SynapseML/issues/2014
Thanks in advance ;)
@zhangqibot can you please use a more recent version. 0.11.2 at least. mmlspark is a REALLY old version that we don't work on anymore.
@gonzaest3 can you give more details? #partitions, #nodes, exact stack traces.
please try without barrier mode (useBarrierExecutionMode = false). This is not a recommended setting, and I'd rather debug your issue without it. useSingleDatasetMode = true does nothing for dataTransferMode = streaming, since streaming ONLY works with single Datasets and ignores this setting.
So use: dataTransferMode = streaming useBarrierExecutionMode = false
Also, please read this, particularly the section on dynamic allocation (the common source of networking problems): https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/LightGBM/Overview/#architecture
Sure @svotaw ! Thanks for the response and sorry for the late response (we've been focusing in other parts). Problem for us seems to be in versions > 0.11.1
Here it's the params passed to LightGBM and also the repartition used (we tried different numbers also but this one was working for 0.11.1)
MODEL_PARAMS = {
"featuresCol": FEATURE_COLUMN,
"labelCol": TARGET_COLUMN,
"deterministic": True,
"baggingSeed": SEED,
"dataRandomSeed": SEED,
"dropSeed": SEED,
"extraSeed": SEED,
"objectiveSeed": SEED,
"seed": SEED,
"maxDepth": -1,
"slotNames": INPUT_COLUMN_NAMES,
"categoricalSlotNames": CATEGORICAL_COLUMNS,
"maxCatThreshold": 16,
"maxCatToOnehot": 7,
"useMissing": True,
"useBarrierExecutionMode": False,
"executionMode": "streaming",
# "microBatchSize": 50,
# "numBatches": 16,
"verbosity": 2,
}
repartitioned_train_data = train_data.repartition(200).cache()
model = LightGBMClassifier(**MODEL_PARAMS)
model = model.fit(repartitioned_train_data)
We already ensured that cluster doesn't have any autoscaling policy or dynamic repartition and we also tried different numBatches, microBatchSize, numTasks... but all of them without success.
The cluster used in Databricks has the following resources (also tried giving more resources without success):
Driver: Standard_DS13_v2 · Workers: Standard_DS13_v2 · 3-3 workers · 13.3 LTS ML (includes Apache Spark 3.4.1, Scala 2.12)
Here is the stack trace we got with 0.11.2 (same error for other versions like 1.0.1 or 1.0.2) using same parameters (just changing executionMode
to dataTransferMode
):
Py4JJavaError Traceback (most recent call last)
File <command-2330636204477965>, line 10
7 print(f"MODEL_PARAMS: {MODEL_PARAMS}")
9 model = LightGBMClassifier(**MODEL_PARAMS)
---> 10 model = model.fit(repartitioned_train_data)
File /databricks/python/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py:552, in safe_patch.<locals>.safe_patch_function(*args, **kwargs)
550 patch_function.call(call_original, *args, **kwargs)
551 else:
--> 552 patch_function(call_original, *args, **kwargs)
554 session.state = "succeeded"
556 try_log_autologging_event(
557 AutologgingEventLogger.get_logger().log_patch_function_success,
558 session,
(...)
562 kwargs,
563 )
File /databricks/python/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py:252, in with_managed_run.<locals>.patch_with_managed_run(original, *args, **kwargs)
249 managed_run = create_managed_run()
251 try:
--> 252 result = patch_function(original, *args, **kwargs)
253 except (Exception, KeyboardInterrupt):
254 # In addition to standard Python exceptions, handle keyboard interrupts to ensure
255 # that runs are terminated if a user prematurely interrupts training execution
256 # (e.g. via sigint / ctrl-c)
257 if managed_run:
File /databricks/python/lib/python3.10/site-packages/mlflow/pyspark/ml/__init__.py:1137, in autolog.<locals>.patched_fit(original, self, *args, **kwargs)
1135 if t.should_log():
1136 with _AUTOLOGGING_METRICS_MANAGER.disable_log_post_training_metrics():
-> 1137 fit_result = fit_mlflow(original, self, *args, **kwargs)
1138 # In some cases the `fit_result` may be an iterator of spark models.
1139 if should_log_post_training_metrics and isinstance(fit_result, Model):
File /databricks/python/lib/python3.10/site-packages/mlflow/pyspark/ml/__init__.py:1123, in autolog.<locals>.fit_mlflow(original, self, *args, **kwargs)
1121 input_training_df = args[0].persist(StorageLevel.MEMORY_AND_DISK)
1122 _log_pretraining_metadata(estimator, params, input_training_df)
-> 1123 spark_model = original(self, *args, **kwargs)
1124 _log_posttraining_metadata(estimator, spark_model, params, input_training_df)
1125 input_training_df.unpersist()
File /databricks/python/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py:533, in safe_patch.<locals>.safe_patch_function.<locals>.call_original(*og_args, **og_kwargs)
530 original_result = original(*_og_args, **_og_kwargs)
531 return original_result
--> 533 return call_original_fn_with_event_logging(_original_fn, og_args, og_kwargs)
File /databricks/python/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py:468, in safe_patch.<locals>.safe_patch_function.<locals>.call_original_fn_with_event_logging(original_fn, og_args, og_kwargs)
459 try:
460 try_log_autologging_event(
461 AutologgingEventLogger.get_logger().log_original_function_start,
462 session,
(...)
466 og_kwargs,
467 )
--> 468 original_fn_result = original_fn(*og_args, **og_kwargs)
470 try_log_autologging_event(
471 AutologgingEventLogger.get_logger().log_original_function_success,
472 session,
(...)
476 og_kwargs,
477 )
478 return original_fn_result
File /databricks/python/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py:530, in safe_patch.<locals>.safe_patch_function.<locals>.call_original.<locals>._original_fn(*_og_args, **_og_kwargs)
522 # Show all non-MLflow warnings as normal (i.e. not as event logs)
523 # during original function execution, even if silent mode is enabled
524 # (`silent=True`), since these warnings originate from the ML framework
525 # or one of its dependencies and are likely relevant to the caller
526 with set_non_mlflow_warnings_behavior_for_current_thread(
527 disable_warnings=False,
528 reroute_warnings=False,
529 ):
--> 530 original_result = original(*_og_args, **_og_kwargs)
531 return original_result
File /databricks/spark/python/pyspark/ml/base.py:205, in Estimator.fit(self, dataset, params)
203 return self.copy(params)._fit(dataset)
204 else:
--> 205 return self._fit(dataset)
206 else:
207 raise TypeError(
208 "Params must be either a param map or a list/tuple of param maps, "
209 "but got %s." % type(params)
210 )
File /local_disk0/spark-e8bc8fb5-f2a2-4835-9c67-ac322c3bff4e/userFiles-b8797d39-04b9-4c27-acd2-defba05528b2/com_microsoft_azure_synapseml_lightgbm_2_12_0_11_2.jar/synapse/ml/lightgbm/LightGBMClassifier.py:2148, in LightGBMClassifier._fit(self, dataset)
2147 def _fit(self, dataset):
-> 2148 java_model = self._fit_java(dataset)
2149 return self._create_model(java_model)
File /databricks/spark/python/pyspark/ml/wrapper.py:378, in JavaEstimator._fit_java(self, dataset)
375 assert self._java_obj is not None
377 self._transfer_params_to_java()
--> 378 return self._java_obj.fit(dataset._jdf)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
1316 command = proto.CALL_COMMAND_NAME +\
1317 self.command_header +\
1318 args_command +\
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
1326 if hasattr(temp_arg, "_detach"):
File /databricks/spark/python/pyspark/errors/exceptions/captured.py:188, in capture_sql_exception.<locals>.deco(*a, **kw)
186 def deco(*a: Any, **kw: Any) -> Any:
187 try:
--> 188 return f(*a, **kw)
189 except Py4JJavaError as e:
190 converted = convert_exception(e.java_exception)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))
Py4JJavaError: An error occurred while calling o875.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 207.0 failed 4 times, most recent failure: Lost task 1.3 in stage 207.0 (TID 1942) (172.28.23.136 executor 1): java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:613)
at java.net.Socket.connect(Socket.java:561)
at java.net.Socket.<init>(Socket.java:457)
at java.net.Socket.<init>(Socket.java:234)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:133)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:120)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:115)
at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:111)
at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:197)
at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:132)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615)
at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:224)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:931)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:931)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:407)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:404)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:371)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:407)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:404)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:371)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:196)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:181)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:146)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:41)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:99)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:104)
at scala.util.Using$.resource(Using.scala:269)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:103)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:146)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$8(Executor.scala:897)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1709)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:900)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:795)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3588)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3519)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3506)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3506)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1516)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1516)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1516)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3835)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3747)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3735)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$runJob$1(DAGScheduler.scala:1240)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1228)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2959)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$runSparkJobs$1(Collector.scala:286)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:282)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$collect$1(Collector.scala:366)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:363)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:117)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:124)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:126)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:114)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:94)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$computeResult$1(ResultCacheManager.scala:557)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.collectResult$1(ResultCacheManager.scala:545)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.computeResult(ResultCacheManager.scala:565)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:426)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:419)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:313)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeCollectResult$1(SparkPlan.scala:519)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:516)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3628)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4553)
at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3595)
at org.apache.spark.sql.Dataset.$anonfun$withAction$3(Dataset.scala:4544)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:945)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4542)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:274)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:498)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:201)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1113)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:151)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:447)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4542)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:3595)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks(LightGBMBase.scala:623)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining(LightGBMBase.scala:598)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch(LightGBMBase.scala:446)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$train$2(LightGBMBase.scala:62)
at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb(SynapseMLLogging.scala:93)
at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb$(SynapseMLLogging.scala:90)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:27)
at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logTrain(SynapseMLLogging.scala:84)
at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logTrain$(SynapseMLLogging.scala:83)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logTrain(LightGBMClassifier.scala:27)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:64)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:36)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:613)
at java.net.Socket.connect(Socket.java:561)
at java.net.Socket.<init>(Socket.java:457)
at java.net.Socket.<init>(Socket.java:234)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:133)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:120)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:115)
at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:111)
at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:197)
at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:132)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615)
at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:224)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:931)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:931)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:407)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:404)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:371)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:407)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:404)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:371)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:196)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:181)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:146)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:41)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:99)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:104)
at scala.util.Using$.resource(Using.scala:269)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:103)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:146)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$8(Executor.scala:897)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1709)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:900)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:795)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
SynapseML version
com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc3
System information
I am using mmlspark with:
Describe the problem
I'm encountering a connection error while using lightgbm from mmlspark. My training data is rather large, so I tried testing a smaller subset, and it worked without any issues. Yet, even after boosting the executor memory, all subsequent trials failed with the same error. I'd appreciate any suggestions you may have to help me fix this issue. Thanks in advance!
Code to reproduce issue
Other info / logs
What component(s) does this bug affect?
area/cognitive
: Cognitive projectarea/core
: Core projectarea/deep-learning
: DeepLearning projectarea/lightgbm
: Lightgbm projectarea/opencv
: Opencv projectarea/vw
: VW projectarea/website
: Websitearea/build
: Project build systemarea/notebooks
: Samples under notebooks folderarea/docker
: Docker usagearea/models
: models related issueWhat language(s) does this bug affect?
language/scala
: Scala source codelanguage/python
: Pyspark APIslanguage/r
: R APIslanguage/csharp
: .NET APIslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse
: Azure Synapse integrationsintegrations/azureml
: Azure ML integrationsintegrations/databricks
: Databricks integrations