microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 831 forks source link

[LightGBM] java.util.NoSuchElementException with ranker.fit() #1973

Open jovis-gnn opened 1 year ago

jovis-gnn commented 1 year ago

SynapseML version

0.10.2

System information

Describe the problem

I'm testing lightgbm on EMR cluster. I tried to create sample dataset and fit the dataset to LightGBMRanker model. I've got some errors and it seems to have some problem collecting dataset. Please give me some feedback if you have some idea...

Thank you.

Code to reproduce issue

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler

spark = (
    SparkSession.builder.appName("jovis")
    .config("spark.sql.session.timeZone", "UTC")
    .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.10.2")
    .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
    .enableHiveSupport()
    .getOrCreate()
)

train_df = spark.createDataFrame(
    [
        [1.0, 1.0, 1.0, 0, 0.0, False],
        [1.0, 2.0, 2.0, 0, 1.0, False],
        [1.0, 6.0, 13.0, 0, 0.0, False],
        [1.0, 3.0, 14.0, 0, 1.0, False],
        [1.0, 3.0, 12.0, 0, 0.0, False],
        [1.0, 8.0, 6.0, 0, 1.0, False],
        [1.0, 4.0, 4.0, 1, 1.0, False],
        [1.0, 3.0, 8.0, 1, 0.0, False],
        [1.0, 4.0, 3.0, 1, 0.0, False],
        [1.0, 7.0, 2.0, 1, 0.0, False],
        [1.0, 2.0, 1.0, 1, 0.0, False]
    ], 
    ['feat_1', 'feat_2', 'feat_3', 'group_id', 'label', 'validation']
)
featurizer = VectorAssembler(inputCols=['feat_1', 'feat_2', 'feat_3'], outputCol="features")
train_df = featurizer.transform(train_df)

from synapse.ml.lightgbm import LightGBMRanker
ranker = LightGBMRanker(
    labelCol="label",
    featuresCol="features",
    groupCol="group_id",
    validationIndicatorCol="validation",
    objective="lambdarank",
    numLeaves=31,
    numIterations=200,
    metric="map",
    boostingType="gbdt",
    evalAt=[1, 5, 10],
    earlyStoppingRound=10
)

ranker.fit(train_df)

Other info / logs

23/06/02 06:45:41 WARN TaskSetManager: Lost task 0.0 in stage 30.0 (TID 183) (ip-172-31-143-99.ap-northeast-2.compute.internal executor 3): java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:529)
    at scala.None$.get(Option.scala:527)
    at com.microsoft.azure.synapse.ml.lightgbm.dataset.PeekingIterator.peek(DatasetAggregator.scala:113)
    at com.microsoft.azure.synapse.ml.lightgbm.dataset.BaseChunkedColumns.<init>(DatasetAggregator.scala:130)
    at com.microsoft.azure.synapse.ml.lightgbm.dataset.DenseChunkedColumns.<init>(DatasetAggregator.scala:217)
    at com.microsoft.azure.synapse.ml.lightgbm.BulkPartitionTask.getChunkedColumns(BulkPartitionTask.scala:76)
    at com.microsoft.azure.synapse.ml.lightgbm.BulkPartitionTask.$anonfun$preparePartitionDataInternal$1(BulkPartitionTask.scala:47)
    at scala.Option.map(Option.scala:230)
    at com.microsoft.azure.synapse.ml.lightgbm.BulkPartitionTask.preparePartitionDataInternal(BulkPartitionTask.scala:46)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.preparePartitionData(BasePartitionTask.scala:210)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:121)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:201)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:138)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

23/06/02 06:45:42 WARN TaskSetManager: Lost task 0.1 in stage 30.0 (TID 184) (ip-172-31-143-99.ap-northeast-2.compute.internal executor 3): java.net.ConnectException: Connection refused (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:607)
    at java.net.Socket.connect(Socket.java:556)
    at java.net.Socket.<init>(Socket.java:452)
    at java.net.Socket.<init>(Socket.java:229)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:129)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:116)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:111)
    at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:107)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:179)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:114)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:201)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:138)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

23/06/02 06:45:44 ERROR TaskSetManager: Task 0 in stage 30.0 failed 4 times; aborting job
23/06/02 06:45:44 ERROR LightGBMRanker: {"buildVersion":"0.10.2","className":"class com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker","method":"train","uid":"LightGBMRanker_dfa33d4f60b4"}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 4 times, most recent failure: Lost task 0.3 in stage 30.0 (TID 186) (ip-172-31-143-99.ap-northeast-2.compute.internal executor 3): java.net.ConnectException: Connection refused (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:607)
    at java.net.Socket.connect(Socket.java:556)
    at java.net.Socket.<init>(Socket.java:452)
    at java.net.Socket.<init>(Socket.java:229)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:129)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:116)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:111)
    at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:107)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:179)
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:114)
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:201)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:138)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2798) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:?]
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) ~[scala-library-2.12.15.jar:?]
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ~[scala-library-2.12.15.jar:?]
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2798) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1239) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1239) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.15.jar:?]
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1239) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3051) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2993) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1009) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2229) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2250) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2269) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2294) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.collect(RDD.scala:1020) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:441) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:483) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:522) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:483) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3932) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3161) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3922) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:554) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3920) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) ~[spark-catalyst_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) ~[spark-catalyst_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:138) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3920) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.sql.Dataset.collect(Dataset.scala:3161) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks(LightGBMBase.scala:597) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks$(LightGBMBase.scala:583) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.executePartitionTasks(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining(LightGBMBase.scala:573) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining$(LightGBMBase.scala:545) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.executeTraining(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch(LightGBMBase.scala:435) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch$(LightGBMBase.scala:392) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.trainOneDataBatch(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$train$2(LightGBMBase.scala:61) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb(BasicLogging.scala:62) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb$(BasicLogging.scala:59) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.logVerb(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain(BasicLogging.scala:48) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain$(BasicLogging.scala:47) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.logTrain(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:42) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:35) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.train(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.train(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at org.apache.spark.ml.Predictor.fit(Predictor.scala:151) ~[spark-mllib_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_372]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_372]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_372]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_372]
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) ~[py4j-0.10.9.5.jar:?]
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) ~[py4j-0.10.9.5.jar:?]
    at py4j.Gateway.invoke(Gateway.java:282) ~[py4j-0.10.9.5.jar:?]
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) ~[py4j-0.10.9.5.jar:?]
    at py4j.commands.CallCommand.execute(CallCommand.java:79) ~[py4j-0.10.9.5.jar:?]
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) ~[py4j-0.10.9.5.jar:?]
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106) ~[py4j-0.10.9.5.jar:?]
    at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372]
Caused by: java.net.ConnectException: Connection refused (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_372]
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_372]
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_372]
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_372]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_372]
    at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_372]
    at java.net.Socket.connect(Socket.java:556) ~[?:1.8.0_372]
    at java.net.Socket.<init>(Socket.java:452) ~[?:1.8.0_372]
    at java.net.Socket.<init>(Socket.java:229) ~[?:1.8.0_372]
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:129) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:116) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:111) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:107) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:179) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:114) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:201) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_372]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_372]
    ... 1 more

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

github-actions[bot] commented 1 year ago

Hey @jovis-gnn :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

saileshbaidya commented 1 year ago

Thanks, jovis-gnn for reporting this.

The exception NoSuchElementException seems to be the result of not being able to compute an RDD partition or read it from a checkpoint because the attempt to connect to the Spark driver is failing with exception java.net.ConnectException. Can you please check along this line on the EMR cluster? In the meantime, we will investigate this further on our end because, I also see a possibility to address the second half of this scenario on our end.

@svotaw can you also please take a look? Looks like we need to handle NoSuchElementException in Data Aggregator.

jovis-gnn commented 1 year ago

@saileshbaidya Thanks for reply I found the reason of this Exception. Even though there were no "True" values in validation column(sample data) but I specified validationIndicatorCol, so lightgbm module returned NoSuchElementException. After changing some of those values to "True", it resolved.

By the way, after solving the problem, data collecting stage pending for long time(never end). But after repartitioning the training set to 1, it worked. Could you help me about the reason(minimum / maximum number of partition for training or something)?

svotaw commented 1 year ago

DataAggregator is being deprecated, so we won't mess with that. The newer "streaming" mode is available in our latest releases. Please ask for a copy if you want to try that (no official version yet with latest fixes).

LightGBM algorithm does not work with auto-scaled clusters, so please turn off any scaling. Also, it helps to set "spark.dynamicAllocation.enabled": "false".

You are likely hitting scaling problems which affect networking (which look like hangs). By repartitioning to 1, you are removing networking (only using 1 node). You can try smaller numbers to improve hangs with the version you have.

svotaw commented 1 year ago

We have released 11.2, which has the final streaming features.