Support Gov Cloud S3 Endpoints

shawnweeks commented 4 years ago

Currently we are hard coding s3.amazonaws.com which breaks connections to things with different urls like Gov Cloud.

Updated description as I understand the issue better. See my comments below.

lucagiovagnoli commented 4 years ago

Hi @shawnweeks , I'm sorry, could you provide some more pointers please ?

A link to the code where we are hardcoding s3.amazonaws.com
A stacktrace of the error that you are getting

shawnweeks commented 4 years ago

Currently this doesn't work on Gov Cloud because we're calling the S3 Client without passing in a region. I'm working on the patch though. Just wanted to track it. I'll update once I track down the exact issue.

shawnweeks commented 4 years ago

Specifically the default s3 client constructor doesn’t work against gov cloud. You have to use the builder and specify the region. Currently Scala is torturing me though.

arahman commented 3 years ago

Hi @shawnweeks , I'm sorry, could you provide some more pointers please ?

A link to the code where we are hardcoding s3.amazonaws.com

A stacktrace of the error that you are getting

In io.github.spark_redshift_community.spark.redshift.Util the method def addEndpointToUrl(url: String, domain: String = "s3.amazonaws.com"): String. It would be good, if it instead use the hadoop config property fs.s3a.endpoint.

WARN - Utils$ - An error occurred while trying to determine the S3 bucket's region com.amazonaws.AmazonClientException: Signature Version 4 requires knowing the region of the bucket you're trying to access. You can configure a region by calling AmazonS3Client.setRegion(Region) or AmazonS3Client.setEndpoint(String) with a region-specific endpoint such as "s3-us-west-2.amazonaws.com". at com.amazonaws.services.s3.AmazonS3Client.createSigner(AmazonS3Client.java:2958) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3526) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480) at com.amazonaws.services.s3.AmazonS3Client.getBucketLocation(AmazonS3Client.java:678) at com.amazonaws.services.s3.AmazonS3Client.getBucketLocation(AmazonS3Client.java:686) at io.github.spark_redshift_community.spark.redshift.Utils$.getRegionForS3Bucket(Utils.scala:182) at io.github.spark_redshift_community.spark.redshift.RedshiftWriter$$anonfun$saveToRedshift$1.apply(RedshiftWriter.scala:364) at io.github.spark_redshift_community.spark.redshift.RedshiftWriter$$anonfun$saveToRedshift$1.apply(RedshiftWriter.scala:363) at scala.Option.foreach(Option.scala:257) at io.github.spark_redshift_community.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:363) at io.github.spark_redshift_community.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:109) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)

shawnweeks commented 3 years ago

It's been almost a year since I looked at this and we've decided to just build out the capability ourselves. Short answer is this library doesn't work on gov cloud because it uses an older aws sdk constructor method that's no longer supported by the AWS sdk and doesn't correctly find the endpoint for gov cloud.

spark-redshift-community / spark-redshift

Support Gov Cloud S3 Endpoints #53