nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

README's instructions for s3a seem incomplete or inaccurate #240

Closed qwertystop closed 6 years ago

qwertystop commented 6 years ago

According to the README, to get the cluster to access data on S3 through s3a:

We recommend you access data on S3 from your Flintrock cluster by following these steps:

  1. Setup an IAM Role that grants access to S3 as desired. Reference this role when you launch your cluster using the --ec2-instance-profile-name option (or its equivalent in your config.yaml file).
  2. Reference S3 paths in your Spark code using the s3a:// prefix. s3a:// is backwards compatible with s3n:// and replaces both s3n:// and s3://. The Hadoop project recommends using s3a:// since it is actively developed, supports larger files, and offers better performance.
  3. Make sure Flintrock is configured to use Hadoop/HDFS 2.7+. Earlier versions of Hadoop do not have solid implementations of s3a://. Flintrock's default is Hadoop 2.7.4, so you don't need to do anything here if you're using a vanilla configuration. With this approach you don't need to copy around your AWS credentials or pass them into your Spark programs. As long as the assigned IAM role allows it, Spark will be able to read and write data to S3 simply by referencing the appropriate path (e.g. s3a://bucket/path/to/file).

I tried this, and it isn't working; either I'm overlooking something, the documentation skipped a step, or something's wrong in the default configuration. Possibly several of those.

My Flintrock configuration is entirely default (as produced by flintrock configure), except for filling in key-name, identity-file, and instance-profile-name, and changing the region to us-east-2, the AMI to ami-25615740 (the us-east-2 equivalent of the default), and the instance-type to t2.micro (free). But the versions of Spark and HDFS are default (2.2.0 and 2.7.3, from the default download sources). I've reinitialized it (and then made the above specific changes, and no others) to be sure of that.

I have an IAM role called S3_accessor, with one policy, AmazonS3FullAccess. That's given as the instance-profile-name in Flintrock's configuration.

Then I run flintrock launch test-cluster. No errors are reported. flintrock login test-cluster. One message about available updates through yum, which I do not apply at this time (for the sake of reproducibility).

Then, spark-shell. Dependency resolution produces no errors. Here's the console after it's done, through to the stack trace when I try to read from S3:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/03/16 23:25:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/16 23:25:10 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://ec2-18-218-79-104.us-east-2.compute.amazonaws.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1521242701346).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val textFile = sc.textFile("s3a://qtest11/realestate.txt")
textFile: org.apache.spark.rdd.RDD[String] = s3a://qtest11/realestate.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> textFile.count()
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 1A7F70463AA88DFE, AWS Error Code: null, AWS Error Message: Bad Request
  at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
  at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
  at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
  at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
  at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
  at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
  ... 48 elided

scala> 

I've confirmed that the file exists; it's a little text file I had lying around, uploaded it with the web interface, downloaded it back with the CLI, all with the same auth as I gave to Flintrock.

Looking it up, the only suggestion I found was to run sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-us-east-2.amazonaws.com") before trying to load the text file. This had no effect (identical stack trace).

nchammas commented 6 years ago

Some debugging questions:

  1. If you flintrock login to the cluster and run aws s3 ls s3://qtest11/ from there, what do you get?
  2. Do you see hadoop-aws mentioned when you start up the Spark shell?
  3. Can you confirm that spark/conf/spark-defaults.conf on the cluster exists, and if so can you paste the contents here?
qwertystop commented 6 years ago
  1. Looks fine.

    $ aws s3 ls s3://qtest11/
    2018-03-16 19:47:52        382 realestate.txt
  2. hadoop-aws appears twice, near the top:

    $ spark-shell 
    Ivy Default Cache set to: /home/ec2-user/.ivy2/cache
    The jars for the packages stored in: /home/ec2-user/.ivy2/jars
    :: loading settings :: url = jar:file:/home/ec2-user/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
    org.apache.hadoop#hadoop-aws added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found org.apache.hadoop#hadoop-aws;2.7.3 in central
  3. That conf file exists; it has one line:

    
    spark.jars.packages    org.apache.hadoop:hadoop-aws:2.7.3
nchammas commented 6 years ago

Looks like us-east-2 only supports V4 signatures, in which case this answer should help.

nchammas commented 6 years ago

And if it does help, then I'm not sure there is anything I can do here. I suppose I could add a warning to the README about this, but there must be so many little gotchas like this I'm not sure it makes sense to try to address them here.

qwertystop commented 6 years ago

Yeah, doesn't really look like it's on you to fix. Thanks for finding it anyway. Though that answer isn't working – but again, not on you.

nchammas commented 6 years ago

But let's get to the bottom of the issue here regardless. Can you list all the settings you've added? It looks like you need to set fs.s3a.endpoint and com.amazonaws.services.s3.enableV4 at the very least. How are you setting the latter?

qwertystop commented 6 years ago
scala> System.setProperty("com.amazonaws.services.s3.enableV4", "true")
scala> System.setProperty("fs.s3a.endpoint", "s3-us-east-2.amazonaws.com")

as suggested by that SO post.

nchammas commented 6 years ago

I think fs.s3a.endpoint needs to be set via hadoopConfiguration. Does that change anything?

qwertystop commented 6 years ago

Yep, that makes it work. Thanks for helping a novice through things.

nchammas commented 6 years ago

OK great! Glad we found a solution.