Closed conker84 closed 2 years ago
Hi @conker84, I'm getting exactly the same error log
java.lang.IllegalArgumentException: Wrong FS s3://<bucket-name>//athena_query/64902755-120d-4f9f-8cb6-660b4c1099fb/manifest.json -expected s3a://<bucket-name>
at org.apache.hadoop.fs.s3native.S3xLoginHelper.checkPath(S3xLoginHelper.java:224)
at org.apache.hadoop.fs.s3a.S3AFileSystem.checkPath(S3AFileSystem.java:1155)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:666)
at org.apache.hadoop.fs.s3a.S3AFileSystem.makeQualified(S3AFileSystem.java:1117)
at org.apache.hadoop.fs.s3a.S3AFileSystem.qualify(S3AFileSystem.java:1143)
at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:1318)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1064)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1052)
at io.github.spark_redshift_community.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:330)
...
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
with following specs:
Spark v3.2.0
SBT dependencies
lazy val sparkDeps = Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
"org.apache.spark" %% "spark-mllib" % sparkVersion % Provided,
"org.apache.spark" %% "spark-avro" % sparkVersion,
"org.apache.hadoop" % "hadoop-aws" % "3.2.0" % Provided,
"com.amazonaws" % "aws-java-sdk-bundle" % "1.11.962" % Provided,
"com.amazon.redshift" % "redshift-jdbc42" % "1.2.10.1009",
"io.github.spark-redshift-community" %% "spark-redshift" % "5.0.4",
)
dependencyOverrides ++= Seq(
"org.apache.hadoop" % "hadoop-aws" % "3.2.0"
)
when I try to call the this function (write DataFrame as Redshift table)
df.write
.format("io.github.spark_redshift_community.spark.redshift")
.option("url", s"jdbc:redshift://$redshiftHost:$port/$database?user=$username&password=$password")
.option("dbtable", table)
.option("tempdir", s3TempDirPath)
.option(if(iamRole != null) "aws_iam_role" else "forward_spark_s3_credentials",
if(iamRole != null) iamRole else "true")
.mode(saveMode)
.option("tempformat", "CSV GZIP")
.option("extracopyoptions", "DELIMITER ','")
.option("csvnullstring","")
writer.save()
moreover, I have to say that
I never faced this issue using Spark v3.0.2
(or lower)
s3TempDirPath
is in the form s3a://<bucket-name>/athena_query/
I use following credentials to setup Hadoop
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY_ID)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_ACCESS_KEY)
In my idea, sanitizedTempDir
could still been used because of Utils.removeCredentialsFromURI()
which allows to remove credentials, but I don't understand why Utils.fixS3Url()
which exactly does
scala url.replaceAll("s3[an]://", "s3://")
has to be used here while s3a protocol should be required (also because of S3A Hadoop configs).
Here is the change I propose to fix the bug.
Feel free to discuss or propose another solution. Thanks 👍
@jsleight @lucagiovagnoli @88manpreet can you please support us on it? Many thanks!
there is already a PR waiting for review @gios91
I just merged #110 which should fix this.
Hi all and @conker84,
I am also getting the same error as above when writing redshift from PySpark.
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.us-east-1.amazonaws.com")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
......
block_df.write.format("io.github.spark_redshift_community.spark.redshift"). \
option("url", f"jdbc:redshift://{secrets['host']}:{secrets['port']}"
f"/{secrets['db']?user={secrets['username']}&password={secrets['password']}"). \
option("dbtable", table). \
option("tempdir", f's3a://{bucket}/{prefix}/tmp/'). \
option("aws_iam_role", secrets['iam_role']). \
mode("overwrite").save()
and I get the following:
Traceback (most recent call last):
File "/home/ubuntu/spark_jobs/project/and/scripts/{script_name}.py", line 228, in <module>
block_df.write.format("io.github.spark_redshift_community.spark.redshift"). \
File "/opt/spark/spark-3.3.1-bin-hadoop3/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 966, in save
File "/opt/spark/spark-3.3.1-bin-hadoop3/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/opt/spark/spark-3.3.1-bin-hadoop3/python/lib/pyspark.zip/pyspark/sql/utils.py", line 196, in deco
pyspark.sql.utils.IllegalArgumentException: Wrong FS s3://{bucket}//{prefix}/tmp/5b519ea1-00c0-48b0-98a7-64e10aa0a97d/manifest.json -expected s3a://{bucket}
I am using PySpark version 3.3.1 and scala version 2.12.15.
I downloaded the Spark-Redshift jar from the following, https://mvnrepository.com/artifact/io.github.spark-redshift-community/spark-redshift_2.12/5.1.0 (last updated Sep 30, 2022), which downloads spark-redshift_2.12-5.1.0-amzn-0.jar and I have it located in my SPARK_HOME.
Based on the date of the the PR associated with this issue (#110) and the last updated date of the maven repo, I assume the changes were included in the jars but I'm not 100% sure. I attemped to read the RedshiftWriter.class file using Javap but was not able to discern if the updates were included.
Let me know if you need more information. Any insight on the above would be much appreciated.
Thanks in advance! :)
Hmmm, our changelog indicates the fix should be included in v5.1.0.
@JeremyBrent your traceback is slightly different than this issue though. I.e., your error is coming from python whereas the original error is coming from the jvm. I'm guessing maybe python intercepted the jvm exception and reformatted it?
Without the jvm stacktrace it is hard to tell what is going on unfortunately.
Hi @jsleight, thanks for getting back to me. Yeah I cannot seem to find the jvm exception in any of the logs.
The confusing part is that the job finishes with a success status (below comes from the spark events log file):
{"Event":"SparkListenerJobEnd","Job ID":6,"Completion Time":1677096372200,"Job Result":{"Result":"JobSucceeded"}}
The Spark History server shows that all executors completed their tasks successfully; all stages were marked as successful; all jobs were marked as successful; etc.
Theres no sign that this job failed (though it did) except for the small PySpark stack trace shown above and the lack of output data in Redshift.
Below is a stacktrace from the save at RedshiftWriter.scala:301
stage but there are no exceptions here (not sure if it'll provide any insight)
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
io.github.spark_redshift_community.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:301)
io.github.spark_redshift_community.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:403)
io.github.spark_redshift_community.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:115)
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
Figured out my issue ....
I was incorrect in my initial assumption ("I downloaded the Spark-Redshift jar from the following, https://mvnrepository.com/artifact/io.github.spark-redshift-community/spark-redshift_2.12/5.1.0 (last updated Sep 30, 2022)")
We were using spark-redshift_2.12-5.1.0-amzn-0.jar which we obtained from an AWS EMR Cluster but needed to be using spark-redshift_2.12-5.1.0.jar from the link above. Once I updated the jar it worked fine.
Putting this here in case anyone runs into this issue in the future.
Thanks for reporting back! cc @Brooke-white @kunaldeepsingh seems like there a few divergences between the open source and amazon jars.
Hi everybody! I have a simple spark job that gets the data from Neo4j and tries to ingest into RedShift:
But I get the following error:
Looking at the code it seems that the problem is related to these lines:
https://github.com/spark-redshift-community/spark-redshift/blob/1f12470735c17022ae2dd68d0963649577928dde/src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftWriter.scala#L313-L332
At L313
fs
is defined ons3a
protocol instead whereasmanifestPath
at L329 is defined on a sanitized version that relies ons3
protocol causing my errorIs that correct?