Open shawnweeks opened 4 years ago
Hi @shawnweeks , I'm sorry, could you provide some more pointers please ?
Currently this doesn't work on Gov Cloud because we're calling the S3 Client without passing in a region. I'm working on the patch though. Just wanted to track it. I'll update once I track down the exact issue.
Specifically the default s3 client constructor doesn’t work against gov cloud. You have to use the builder and specify the region. Currently Scala is torturing me though.
Hi @shawnweeks , I'm sorry, could you provide some more pointers please ?
- A link to the code where we are hardcoding s3.amazonaws.com
- A stacktrace of the error that you are getting
In io.github.spark_redshift_community.spark.redshift.Util
the method def addEndpointToUrl(url: String, domain: String = "s3.amazonaws.com"): String
. It would be good, if it instead use the hadoop config property fs.s3a.endpoint.
WARN - Utils$ - An error occurred while trying to determine the S3 bucket's region com.amazonaws.AmazonClientException: Signature Version 4 requires knowing the region of the bucket you're trying to access. You can configure a region by calling AmazonS3Client.setRegion(Region) or AmazonS3Client.setEndpoint(String) with a region-specific endpoint such as "s3-us-west-2.amazonaws.com". at com.amazonaws.services.s3.AmazonS3Client.createSigner(AmazonS3Client.java:2958) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3526) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480) at com.amazonaws.services.s3.AmazonS3Client.getBucketLocation(AmazonS3Client.java:678) at com.amazonaws.services.s3.AmazonS3Client.getBucketLocation(AmazonS3Client.java:686) at io.github.spark_redshift_community.spark.redshift.Utils$.getRegionForS3Bucket(Utils.scala:182) at io.github.spark_redshift_community.spark.redshift.RedshiftWriter$$anonfun$saveToRedshift$1.apply(RedshiftWriter.scala:364) at io.github.spark_redshift_community.spark.redshift.RedshiftWriter$$anonfun$saveToRedshift$1.apply(RedshiftWriter.scala:363) at scala.Option.foreach(Option.scala:257) at io.github.spark_redshift_community.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:363) at io.github.spark_redshift_community.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:109) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
It's been almost a year since I looked at this and we've decided to just build out the capability ourselves. Short answer is this library doesn't work on gov cloud because it uses an older aws sdk constructor method that's no longer supported by the AWS sdk and doesn't correctly find the endpoint for gov cloud.
Currently we are hard coding s3.amazonaws.com which breaks connections to things with different urls like Gov Cloud.
Updated description as I understand the issue better. See my comments below.