[PySpark] JSON / Avro issues

eeshugerman commented 4 years ago

I'm running into issues with writes to Redshift. (Reads work great!). Here's what I'm trying to execute:

(
    df.write.format(SPARK_REDSHIFT_RDN).mode('append')
    .option('url', DB_URI)
    .option('tempdir', S3_TEMP_URI)
    .option('aws_iam_role', AWS_IAM_ROLE)
    .option('dbtable', table_name)
    .option('preactions', delete_statement)
    .save()
)

I should mention I'm a Spark/Scala noob. I built my spark-redshift jar by following tutorial/how_to_build.md, and am passing it to spark-submit with the --jars flag.

First I got

py4j.protocol.Py4JJavaError: An error occurred while calling o589.save.
: java.lang.NoClassDefFoundError: com/eclipsesource/json/Json
        at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:147)
        at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)

[truncated]

I found an issue on the old Databricks repo where someone suggested adding --packages com.eclipsesource.minimal-json:minimal-json:0.9.4.

With this addition, I get a new error:

py4j.protocol.Py4JJavaError: An error occurred while calling o589.save.
: org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:647)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:245)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
        at io.github.spark_redshift_community.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:296)

[truncated]

Any ideas? Thanks!

smoy commented 4 years ago

I don't exercise the write path, I am basing this on some spark documentation (note to self: add back an writer integration test. We may have deleted in our hurry to port the original code to Spark 2.4)

https://spark.apache.org/docs/latest/sql-data-sources-avro.html

According to the deployment section, if you are using Spark 2.4 (this is important since this community version only supports 2.4+ at the moment), you have to add the avro dependency

Since you are using the jars method already to handle the json dependency, you can grab the appropriate jar https://mvnrepository.com/artifact/org.apache.spark/spark-avro

Make sure you grab the appropriate scala version artifact.

eeshugerman commented 4 years ago

Thanks for your response!

I ended up getting it working like so:

spark-submit  
  --deploy-mode cluster \
  --master yarn \
  --jars /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar \
  --packages org.apache.spark:spark-avro_2.11:2.4.2,io.github.spark-redshift-community:spark-redshift_2.11:4.0.0 \
  my_script.py

(My coworker pointed out that spark-redshift community edition builds are available on Maven now so there's no need to build it ourselves, so that's why I'm no longer passing it in with --jars.)

I'm confused as to why it's necessary to specify the Avro dependency... why isn't it handled automatically like all the rest? I see it's included in build.sbt.

eeshugerman commented 4 years ago

PR to add this to the README: https://github.com/spark-redshift-community/spark-redshift/pull/49

spark-redshift-community / spark-redshift

[PySpark] JSON / Avro issues #47