nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

S3 access: us-east-1 versus us-east-2 #270

Closed datawookie closed 5 years ago

datawookie commented 5 years ago

Hi Nicholas!

This is not so much an issue, but a question/observation/suggestion.

I have been hitting my head against the problem of accessing S3 files from Spark cluster launched with flintrock.

This is my setup: I've launched a Spark 2.3.1 cluster and added both hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar to the spark/jars folder on master and slaves.

I launch PySpark using

pyspark --master spark://172.___.___.122:7077

What Doesn't Work

I have a bucket on us-east-2, which I try to access as follows:

rdd = sc.textFile("s3a://bucket-on-us-east-2/gutenberg_sub/3/5/6/7/35678/35678.txt")
rdd.collect()

This initially looks like it is going to work.

[Stage 0:>                                                          (0 + 1) / 2]

But then I get a flurry of errors with Status Code: 400 and AWS Error Message: Bad Request.

What Does Work

I have a copy of the same bucket on us-east-1.

rdd = sc.textFile("s3a://bucket-on-us-east-1/gutenberg_sub/3/5/6/7/35678/35678.txt")
rdd.collect()

That works flawlessly.

I've done a lot of reading to try and resolve/understand this issue. I got the idea that some extra settings were required for us-east-2, so I tried it again but using the following:

sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")

hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")

I also tried a bunch of other fiddly options that I read about on SO, but none of them sorted the problem.

After hitting my head against this for quite some time I am just going to use buckets on us-east-1 in future since they seem to work fine.

However this leaves me with some questions:

  1. What's the difference between us-east-1 and us-east-2 that accounts for this behaviour? I've read that us-east-2 only supports version 4 signatures while us-east-1 supports both versions 2 and 4. Is this at the core of the issue?
  2. Would it make sense to mention this in the flintrock README so that other people don't waste a lot of time with this particular issue?

Again thanks for an awesome bit of software: flintrock is making my life so much easier! Just AWS is causing headaches!

Best regards, Andrew.

nchammas commented 5 years ago

Different regions of AWS sometimes support different features or require different options to be set, like in this case. So yes, the main difference is that one region requires a newer signature version whereas the other doesn’t.

There’s no way for Flintrock to account for all the intricacies of AWS, either in code or via documentation, so for these kinds of things I think it’s best to leave the problem to others to solve, like Stack Overflow, the Spark docs, or the AWS docs. There are also some guides linked to from the README that may include information like this.

datawookie commented 5 years ago

Agreed, you can't hope that Flintrock will cater for or document all of the nooks and crannies of AWS.

However, in the interests of saving users time and frustration I think that it might be worthwhile mentioning in the README that S3 access will work well with Version 2 signatures but might be problematic with only Version 4 signatures.

Thanks for your feedback again.