microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.06k stars 831 forks source link

[BUG]java.lang.ArrayIndexOutOfBoundsException on multi-node cluster run #2278

Open bjm88620 opened 2 months ago

bjm88620 commented 2 months ago

SynapseML version

com.microsoft.azure:synapseml_2.12:0.11.4-spark3.3

System information

Describe the problem

I have a for-loop lightgbm fit job for rolling back validation; The job failed on multi-node cluster with log error Connection Refused, and after checked the failed tasks, the executor failed with detail error message java.lang.ArrayIndexOutOfBoundsException and caused the Connection Refused error;

Meanwhile the job can run on single-node cluster without any issue.

The dataframe sent to model is around 48,000, with partition as below

Partition 0 has 19000 records Partition 1 has 18000 records Partition 2 has 7000 records Partition 3 has 4000 records

And the issue cannot be fixed by df.repartition(5).

Screenshot 2024-09-04 at 21 16 29

Code to reproduce issue

max_base_date = '2024-09-01'
tmp_train_df = train_merged_df.where(sf.col('base_date')<max_base_date).cache()
tmp_actual_df = actual_merged_df.where(sf.col('base_date')<max_base_date).cache()
model.fit(tmp_train_df, tmp_actual_df)

Other info / logs

No response

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

bjm88620 commented 1 month ago

Hi @dciborow , I can see the fix PR is created, would like to check whether it will be available for com.microsoft.azure:synapseml_2.12:0.11.4-spark3.3 ? Thanks in advance.