Closed eeshugerman closed 4 years ago
I don't exercise the write path, I am basing this on some spark documentation (note to self: add back an writer integration test. We may have deleted in our hurry to port the original code to Spark 2.4)
https://spark.apache.org/docs/latest/sql-data-sources-avro.html
According to the deployment section, if you are using Spark 2.4 (this is important since this community version only supports 2.4+ at the moment), you have to add the avro dependency
Since you are using the jars method already to handle the json dependency, you can grab the appropriate jar https://mvnrepository.com/artifact/org.apache.spark/spark-avro
Make sure you grab the appropriate scala version artifact.
Thanks for your response!
I ended up getting it working like so:
spark-submit
--deploy-mode cluster \
--master yarn \
--jars /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar \
--packages org.apache.spark:spark-avro_2.11:2.4.2,io.github.spark-redshift-community:spark-redshift_2.11:4.0.0 \
my_script.py
(My coworker pointed out that spark-redshift
community edition builds are available on Maven now so there's no need to build it ourselves, so that's why I'm no longer passing it in with --jars
.)
I'm confused as to why it's necessary to specify the Avro dependency... why isn't it handled automatically like all the rest? I see it's included in build.sbt
.
PR to add this to the README: https://github.com/spark-redshift-community/spark-redshift/pull/49
I'm running into issues with writes to Redshift. (Reads work great!). Here's what I'm trying to execute:
I should mention I'm a Spark/Scala noob. I built my
spark-redshift
jar by followingtutorial/how_to_build.md
, and am passing it tospark-submit
with the--jars
flag.First I got
I found an issue on the old Databricks repo where someone suggested adding
--packages com.eclipsesource.minimal-json:minimal-json:0.9.4
.With this addition, I get a new error:
Any ideas? Thanks!