spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
136 stars 62 forks source link

Spark 3 - java.lang.ArrayStoreException: java.lang.invoke.SerializedLambda #84

Open zHaytam opened 3 years ago

zHaytam commented 3 years ago

Hello,

We're trying to write a dataframe to redshift, using Spark 3.0.1 (on emr) and your connector, but we receive the following error: WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 10.80.139.254, executor 1): java.lang.ArrayStoreException: java.lang.invoke.SerializedLambda

Packages added:

jsleight commented 3 years ago

(I also saw your stackoverflow, so reading a bit from there that you suspect it is crashing on the write to s3 on some pretty simple data)

The write to s3 code is here, do you know which format you are writing with? That would help make a narrower example.

zHaytam commented 3 years ago

Hello,

We tried with Avro (default) and CSV, they both throw the same exception. I also read that part of the source code, I suspect that maybe it's because of either convertedRows or convertedSchema?

Thanks

jsleight commented 3 years ago

Could be, although the converters are just for decimal, date, and timestamp -- which aren't in your example. There is something about making the schema columns be lowercase, which would impact your example -- could see if all lowercase column names helps?

Otherwise, I'd check to see that this isn't a case of spark giving you a misinformative error (e.g., via lazy execution and the issue is actually somewhere else but this was the first spark action). Could try swapping out the redshift write with just an s3 write to the same path.

zHaytam commented 3 years ago

All the columns in the dataframe we're trying to write are lowercase. Also, we are able to write the dataframe to s3 to the same path (without the conversions).

jsleight commented 3 years ago

Do you have an example df? The example you linked in stackoverflow has columns called ["ID", "TYPE", "CODE"] which are all uppercase. If you have decimal, date, or timestamp types in your df, then a bug in the converters seems more likely.

zHaytam commented 3 years ago

The dataframe that we tried is this:

name id type count
x 0 cf 7

Nothing advanced.

jsleight commented 3 years ago

@88manpreet have any ideas? I don't see anything in the converters that should cause this error

88manpreet commented 3 years ago

@zHaytam sorry missed getting back and prioritizing this earlier. Is this issue still happening?

I tried to reproduce it in the integration tests for both avro and csv format. Diff: https://gist.github.com/88manpreet/8049611246ee306628dfc3e9df7eb2ad

which I think imitates the above behavior. I could see the temp files created in the scratch path for both avro and csv format.

I also didn't see anything obviously wrong with the converters. I will keep trying to reproduce it in different ways. I also noticed that the redshift-jdbc42-no-awssdk you are using is the same one we are using.

@zHaytam in the meantime is it possible to test this case with the latest version v5.0.3?

Would it also be possible for you to share the patch of the relevant code you are using to run into this scenario?