Closed qwertystop closed 6 years ago
Some debugging questions:
flintrock login
to the cluster and run aws s3 ls s3://qtest11/
from there, what do you get?hadoop-aws
mentioned when you start up the Spark shell?spark/conf/spark-defaults.conf
on the cluster exists, and if so can you paste the contents here?Looks fine.
$ aws s3 ls s3://qtest11/
2018-03-16 19:47:52 382 realestate.txt
hadoop-aws
appears twice, near the top:
$ spark-shell
Ivy Default Cache set to: /home/ec2-user/.ivy2/cache
The jars for the packages stored in: /home/ec2-user/.ivy2/jars
:: loading settings :: url = jar:file:/home/ec2-user/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found org.apache.hadoop#hadoop-aws;2.7.3 in central
That conf file exists; it has one line:
spark.jars.packages org.apache.hadoop:hadoop-aws:2.7.3
Looks like us-east-2 only supports V4 signatures, in which case this answer should help.
And if it does help, then I'm not sure there is anything I can do here. I suppose I could add a warning to the README about this, but there must be so many little gotchas like this I'm not sure it makes sense to try to address them here.
Yeah, doesn't really look like it's on you to fix. Thanks for finding it anyway. Though that answer isn't working – but again, not on you.
But let's get to the bottom of the issue here regardless. Can you list all the settings you've added? It looks like you need to set fs.s3a.endpoint
and com.amazonaws.services.s3.enableV4
at the very least. How are you setting the latter?
scala> System.setProperty("com.amazonaws.services.s3.enableV4", "true")
scala> System.setProperty("fs.s3a.endpoint", "s3-us-east-2.amazonaws.com")
as suggested by that SO post.
I think fs.s3a.endpoint
needs to be set via hadoopConfiguration
. Does that change anything?
Yep, that makes it work. Thanks for helping a novice through things.
OK great! Glad we found a solution.
According to the README, to get the cluster to access data on S3 through s3a:
I tried this, and it isn't working; either I'm overlooking something, the documentation skipped a step, or something's wrong in the default configuration. Possibly several of those.
My Flintrock configuration is entirely default (as produced by
flintrock configure
), except for filling in key-name, identity-file, and instance-profile-name, and changing the region tous-east-2
, the AMI toami-25615740
(the us-east-2 equivalent of the default), and the instance-type tot2.micro
(free). But the versions of Spark and HDFS are default (2.2.0 and 2.7.3, from the default download sources). I've reinitialized it (and then made the above specific changes, and no others) to be sure of that.I have an IAM role called S3_accessor, with one policy, AmazonS3FullAccess. That's given as the instance-profile-name in Flintrock's configuration.
Then I run
flintrock launch test-cluster
. No errors are reported.flintrock login test-cluster
. One message about available updates throughyum
, which I do not apply at this time (for the sake of reproducibility).Then,
spark-shell
. Dependency resolution produces no errors. Here's the console after it's done, through to the stack trace when I try to read from S3:I've confirmed that the file exists; it's a little text file I had lying around, uploaded it with the web interface, downloaded it back with the CLI, all with the same auth as I gave to Flintrock.
Looking it up, the only suggestion I found was to run
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-us-east-2.amazonaws.com")
before trying to load the text file. This had no effect (identical stack trace).