nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
637 stars 116 forks source link

Issues reading S3 data on spark Cluster. #336

Closed aagarwal1996 closed 3 years ago

aagarwal1996 commented 3 years ago

Hi I launched a flintrock spark cluster that is unable to read data from an s3 bucket. I am launching a spark shell as follows: spark-shell --jars hadoop-aws-2.7.2.jar,aws-java-sdk-1.7.4.jar --packages "joda-time:joda-time:2.10.6" --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-6.amazonaws.com --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true --executor-memory 5G --driver-memory 5G --conf spark.executor.memoryOverhead=500 --master spark://ec2-3-95-133-15.compute-1.amazonaws.com:7077

However, when I attempt to read the data I get the following error: Premature end of Content-Length delimited message body (expected: 12,856,986; received: 8,581).

I am not sure why this is happening since I used identical code and config.yaml file to launch and read data from a spark cluster a few months ago. This issue seems to be new.

Any help would be greatly appreciated!

nchammas commented 3 years ago

Can you post your Flintrock config? The Spark and Hadoop/HDFS versions, in particular, are relevant. They have to align with the version of hadoop-aws that you're using.

aagarwal1996 commented 3 years ago

Thank you for the quick reply! Would my config.yaml file be sufficient? If so, this is my config.yaml:

services: spark:
git-commit: c2a356f1faef79076a7e3d7b9af874469f9683bb git-repository: https://github.com/shifwang/spark hdfs: version: 2.9.2

provider: ec2

providers: ec2: key-name: virginia identity-file: /home/ubuntu/virginia.pem instance-type: t3.large region: us-east-1

availability-zone:

ami: ami-00b882ac5193044e4  # Amazon Linux 2, us-east-1
user: ec2-user

tenancy: default # default | dedicated ebs-optimized: no # yes | no instance-initiated-shutdown-behavior: terminate # terminate | stop

user-data: /path/to/userdata/script

launch: num-slaves: 1

install-hdfs: True

install-spark: False

debug: false

nchammas commented 3 years ago

Instead of --jars hadoop-aws-2.7.2.jar,aws-java-sdk-1.7.4.jar, try --packages org.apache.hadoop:hadoop-aws:2.7.7.

If that doesn't work, try --packages org.apache.hadoop:hadoop-aws:2.9.2.

nchammas commented 3 years ago

@aagarwal1996 - Did that help?

aagarwal1996 commented 3 years ago

Really sorry for the late reply, I tried the solutions above but I am now getting a java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found

aagarwal1996 commented 3 years ago

As a quick update, I downloaded some jars that managed to deal with the java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found issue.

However, when I launched a spark shell with the org.apache.hadoop:hadoop-aws:2.7.7 package, I continued to get the Premature end of Content-Length delimited message body (expected: 12,856,986; received: 8,581) error.

When I launched it org.apache.hadoop:hadoop-aws:2.9.2, I got the following error instead: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

nchammas commented 3 years ago

Please share the line of code you wrote that fails with the class exceptions or otherwise.

Are you using s3n:// or s3a://? You should only use S3A. S3N has been deprecated for a while.

Also, try to avoid downloading individual jars. --packages is more robust and takes care of selecting the appropriate jars for you.

aagarwal1996 commented 3 years ago

I see, thanks for the tip!

I am using s3a. The line of code is scala> val temp_data = spark.read.format("csv").option("inferschema", "true").load("s3a://ACCESS_KEY:SECRET_KEY@aa-yu-data-test/Higgs_data/HIGGSaa.txt")

nchammas commented 3 years ago

I am now getting a java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found

This error suggests that something somewhere is trying to use S3N. The code you shared is obviously using S3A, but I suspect you have some other configs that are confusing the situation.

Are you sure you don't have any Spark or Hadoop configs related to S3N?

mkhan037 commented 3 years ago

I was also having this same issue with Spark 2.4.7 and Hadoop 2.7.7. I searched around and found this https://github.com/delta-io/delta/issues/544. I reverted my JDK from 1.8.0_282 to 1.8.0_265 and the issue went away. I have absolutely no clue how the JDK version can lead to this issue but for me, it seems like that was the case.

On a side node @nchammas, If I use a custom ami with my preferred Java version installed, does flintrock skip installing Java? That would be helpful for me as I want to use the specific JDK version that worked in my case.

nchammas commented 3 years ago

@aagarwal1996 - Any new info here? I'm going to close the issue but we can continue investigating (and reopen if appropriate).

@mkhan037 - Thanks for sharing that tidbit about JDK versions. I agree, it's puzzling.

If you use a custom AMI that provides Java, Flintrock should skip doing its own install provided that it can find the existing Java installation.