microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 832 forks source link

[BUG] LightGBM consistently crashes when trained with categorical features #2051

Open vishalovercome opened 1 year ago

vishalovercome commented 1 year ago

SynapseML version

0.11.2

System information

Describe the problem

LightGBM consistently crashes due to the same reason when its trained with categorical features. Even with < 10 categorical features the code consistently crashes while it works fine with hundreds of continuous features. What is going on? What configuration am I missing?

Code to reproduce issue

spark = (
    SparkSession.builder.appName("uninstall-prediction")
    .config(
        "spark.jars.packages",
        "org.mlflow:mlflow-spark:2.4.1,com.microsoft.azure:synapseml_2.12:0.11.2-spark3.3",
    )
    .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
    .config("spark.sql.shuffle.partitions", 20)
    .config("spark.dynamicAllocation.executorIdleTimeout", 600)
    .config("spark.default.parallelism", 100)
    .config("spark.sql.adaptive.enabled", True)
    .config("spark.sql.adaptive.coalescePartitions.enabled", True)
    .config("spark.sql.execution.arrow.pyspark.enabled", True)
    .config("spark.sql.execution.arrow.pyspark.fallback.enabled", True)
    .enableHiveSupport()
    .getOrCreate()
)

sc = spark.sparkContext
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.broadcastTimeout", "1800")

classifier = (
    LightGBMClassifier()
    .setDefaultListenPort(12402)
    .setUseBarrierExecutionMode(False)
    .setObjective("binary")
    .setIsUnbalance(True)
    .setLeafPredictionCol("leafPrediction")
    .setFeaturesShapCol("featuresShap")
    .setDataTransferMode("streaming")
    .setMicroBatchSize(1000)
    .setUseSingleDatasetMode(True)  # Automatically true in streaming mode
    .setUseMissing(True)
)

I can't share rest of the code as its proprietary.

{
  "spark.dynamicAllocation.enabled": "false"
}

### Other info / logs

```scala-traceback
DS Logging - 2023-08-18T22:57:03.243468 - An error occurred while calling o2672.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1702.0 failed 4 times, most recent failure: Lost task 3.5 in stage 1702.0 (TID 6202) (ip-172-31-253-236.ec2.internal executor 9): java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.connect0(Native Method)
    at java.base/sun.nio.ch.Net.connect(Net.java:579)
    at java.base/sun.nio.ch.Net.connect(Net.java:568)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:593)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/java.net.Socket.connect(Socket.java:583)
    at java.base/java.net.Socket.<init>(Socket.java:507)
    at java.base/java.net.Socket.<init>(Socket.java:287)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:133)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:120)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:115)
    at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:111)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:197)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:132)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615)
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:197)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2974)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2910)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2909)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2909)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1263)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1263)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1263)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3173)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3112)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3101)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1028)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2285)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2304)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2329)
    at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1019)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:1018)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:465)
    at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4244)
    at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3462)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4234)
    at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:570)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4232)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
    at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
    at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
    at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4232)
    at org.apache.spark.sql.Dataset.collect(Dataset.scala:3462)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks(LightGBMBase.scala:623)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining(LightGBMBase.scala:598)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch(LightGBMBase.scala:446)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$train$2(LightGBMBase.scala:62)
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb(SynapseMLLogging.scala:93)
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb$(SynapseMLLogging.scala:90)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:27)
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logTrain(SynapseMLLogging.scala:84)
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logTrain$(SynapseMLLogging.scala:83)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logTrain(LightGBMClassifier.scala:27)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:64)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:36)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
    at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.connect0(Native Method)
    at java.base/sun.nio.ch.Net.connect(Net.java:579)
    at java.base/sun.nio.ch.Net.connect(Net.java:568)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:593)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/java.net.Socket.connect(Socket.java:583)
    at java.base/java.net.Socket.<init>(Socket.java:507)
    at java.base/java.net.Socket.<init>(Socket.java:287)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:133)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:120)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:115)
    at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:111)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:197)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:132)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615)
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:197)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    ... 1 more

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

github-actions[bot] commented 1 year ago

Hey @vishalovercome :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

vishalovercome commented 1 year ago

Detailed error log:

23/08/23 19:48:15 WARN TaskSetManager: Lost task 122.1 in stage 1684.0 (TID 98538) (ip-172-31-246-80.ec2.internal executor 1): java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.connect0(Native Method)
    at java.base/sun.nio.ch.Net.connect(Net.java:579)
    at java.base/sun.nio.ch.Net.connect(Net.java:568)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:593)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/java.net.Socket.connect(Socket.java:583)
    at java.base/java.net.Socket.<init>(Socket.java:507)
    at java.base/java.net.Socket.<init>(Socket.java:287)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:133)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:120)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:115)
    at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:111)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:197)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:132)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615)
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:197)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

23/08/23 19:48:17 ERROR TaskSetManager: Task 122 in stage 1684.0 failed 4 times; aborting job
    at java.base/sun.nio.ch.Net.connect0(Native Method)
    at java.base/sun.nio.ch.Net.connect(Net.java:579)
    at java.base/sun.nio.ch.Net.connect(Net.java:568)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:593)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/java.net.Socket.connect(Socket.java:583)
    at java.base/java.net.Socket.<init>(Socket.java:507)
    at java.base/java.net.Socket.<init>(Socket.java:287)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:133)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:120)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:115)
    at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:111)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:197)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:132)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615)
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:197)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

Driver stacktrace:
23/08/23 19:48:17 ERROR LightGBMClassifier: {"buildVersion":"0.11.2","className":"class com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier","columns":null,"method":"train","uid":"LightGBMClassifier_01a751e562b2"}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 122 in stage 1684.0 failed 4 times, most recent failure: Lost task 122.3 in stage 1684.0 (TID 98540) (ip-172-31-248-101.ec2.internal executor 14): java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.connect0(Native Method)
    at java.base/sun.nio.ch.Net.connect(Net.java:579)
    at java.base/sun.nio.ch.Net.connect(Net.java:568)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:593)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/java.net.Socket.connect(Socket.java:583)
    at java.base/java.net.Socket.<init>(Socket.java:507)
    at java.base/java.net.Socket.<init>(Socket.java:287)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:133)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:120)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:115)
    at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:111)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:197)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:132)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615)
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:197)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2974) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2910) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2909) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:?]
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) ~[scala-library-2.12.15.jar:?]
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ~[scala-library-2.12.15.jar:?]
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2909) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1263) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1263) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.15.jar:?]
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1263) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3173) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3112) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3101) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1028) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2285) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2304) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2329) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1019) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.collect(RDD.scala:1018) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:465) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4244) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3462) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4234) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:570) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4232) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) ~[spark-catalyst_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) ~[spark-catalyst_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4232) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.sql.Dataset.collect(Dataset.scala:3462) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks(LightGBMBase.scala:623) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining(LightGBMBase.scala:598) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch(LightGBMBase.scala:446) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$train$2(LightGBMBase.scala:62) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb(SynapseMLLogging.scala:93) ~[com.microsoft.azure_synapseml-core_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb$(SynapseMLLogging.scala:90) ~[com.microsoft.azure_synapseml-core_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:27) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logTrain(SynapseMLLogging.scala:84) ~[com.microsoft.azure_synapseml-core_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logTrain$(SynapseMLLogging.scala:83) ~[com.microsoft.azure_synapseml-core_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logTrain(LightGBMClassifier.scala:27) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:64) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:36) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at org.apache.spark.ml.Predictor.fit(Predictor.scala:114) ~[spark-mllib_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
    at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) ~[py4j-0.10.9.7.jar:?]
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) ~[py4j-0.10.9.7.jar:?]
    at py4j.Gateway.invoke(Gateway.java:282) ~[py4j-0.10.9.7.jar:?]
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) ~[py4j-0.10.9.7.jar:?]
    at py4j.commands.CallCommand.execute(CallCommand.java:79) ~[py4j-0.10.9.7.jar:?]
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) ~[py4j-0.10.9.7.jar:?]
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106) ~[py4j-0.10.9.7.jar:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
    at sun.nio.ch.Net.connect(Net.java:579) ~[?:?]
    at sun.nio.ch.Net.connect(Net.java:568) ~[?:?]
    at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:593) ~[?:?]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) ~[?:?]
    at java.net.Socket.connect(Socket.java:633) ~[?:?]
    at java.net.Socket.connect(Socket.java:583) ~[?:?]
    at java.net.Socket.<init>(Socket.java:507) ~[?:?]
    at java.net.Socket.<init>(Socket.java:287) ~[?:?]
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:133) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:120) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24) ~[com.microsoft.azure_synapseml-core_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:115) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28) ~[com.microsoft.azure_synapseml-core_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:111) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:197) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:132) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.11.2.jar:0.11.2]
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:197) ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.scheduler.Task.run(Task.scala:141) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    ... 1 more
23/08/23 19:48:18 WARN TransportChannelHandler: Exception in connection from /172.31.240.246:48434
java.net.SocketException: Connection reset
    at sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394) ~[?:?]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426) ~[?:?]
    at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:259) ~[netty-buffer-4.1.87.Final.jar:4.1.87.Final]
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132) ~[netty-buffer-4.1.87.Final.jar:4.1.87.Final]
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.87.Final.jar:4.1.87.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.87.Final.jar:4.1.87.Final]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.87.Final.jar:4.1.87.Final]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]
vishalovercome commented 1 year ago

Tried repartitioning to ensure that there's only one partition and I still get a network error! How is that possible?

a1164714 commented 1 year ago

i have the same error.