Closed aagarwal1996 closed 3 years ago
Can you post your Flintrock config? The Spark and Hadoop/HDFS versions, in particular, are relevant. They have to align with the version of hadoop-aws
that you're using.
Thank you for the quick reply! Would my config.yaml file be sufficient? If so, this is my config.yaml:
services:
spark:
git-commit: c2a356f1faef79076a7e3d7b9af874469f9683bb
git-repository: https://github.com/shifwang/spark
hdfs:
version: 2.9.2
provider: ec2
providers: ec2: key-name: virginia identity-file: /home/ubuntu/virginia.pem instance-type: t3.large region: us-east-1
ami: ami-00b882ac5193044e4 # Amazon Linux 2, us-east-1
user: ec2-user
tenancy: default # default | dedicated ebs-optimized: no # yes | no instance-initiated-shutdown-behavior: terminate # terminate | stop
launch: num-slaves: 1
debug: false
Instead of --jars hadoop-aws-2.7.2.jar,aws-java-sdk-1.7.4.jar
, try --packages org.apache.hadoop:hadoop-aws:2.7.7
.
If that doesn't work, try --packages org.apache.hadoop:hadoop-aws:2.9.2
.
@aagarwal1996 - Did that help?
Really sorry for the late reply, I tried the solutions above but I am now getting a java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
As a quick update, I downloaded some jars that managed to deal with the java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found issue.
However, when I launched a spark shell with the org.apache.hadoop:hadoop-aws:2.7.7 package, I continued to get the Premature end of Content-Length delimited message body (expected: 12,856,986; received: 8,581) error.
When I launched it org.apache.hadoop:hadoop-aws:2.9.2, I got the following error instead: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Please share the line of code you wrote that fails with the class exceptions or otherwise.
Are you using s3n://
or s3a://
? You should only use S3A. S3N has been deprecated for a while.
Also, try to avoid downloading individual jars. --packages
is more robust and takes care of selecting the appropriate jars for you.
I see, thanks for the tip!
I am using s3a. The line of code is scala> val temp_data = spark.read.format("csv").option("inferschema", "true").load("s3a://ACCESS_KEY:SECRET_KEY@aa-yu-data-test/Higgs_data/HIGGSaa.txt")
I am now getting a java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
This error suggests that something somewhere is trying to use S3N. The code you shared is obviously using S3A, but I suspect you have some other configs that are confusing the situation.
Are you sure you don't have any Spark or Hadoop configs related to S3N?
I was also having this same issue with Spark 2.4.7 and Hadoop 2.7.7. I searched around and found this https://github.com/delta-io/delta/issues/544. I reverted my JDK from 1.8.0_282 to 1.8.0_265 and the issue went away. I have absolutely no clue how the JDK version can lead to this issue but for me, it seems like that was the case.
On a side node @nchammas, If I use a custom ami with my preferred Java version installed, does flintrock skip installing Java? That would be helpful for me as I want to use the specific JDK version that worked in my case.
@aagarwal1996 - Any new info here? I'm going to close the issue but we can continue investigating (and reopen if appropriate).
@mkhan037 - Thanks for sharing that tidbit about JDK versions. I agree, it's puzzling.
If you use a custom AMI that provides Java, Flintrock should skip doing its own install provided that it can find the existing Java installation.
Hi I launched a flintrock spark cluster that is unable to read data from an s3 bucket. I am launching a spark shell as follows: spark-shell --jars hadoop-aws-2.7.2.jar,aws-java-sdk-1.7.4.jar --packages "joda-time:joda-time:2.10.6" --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-6.amazonaws.com --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true --executor-memory 5G --driver-memory 5G --conf spark.executor.memoryOverhead=500 --master spark://ec2-3-95-133-15.compute-1.amazonaws.com:7077
However, when I attempt to read the data I get the following error: Premature end of Content-Length delimited message body (expected: 12,856,986; received: 8,581).
I am not sure why this is happening since I used identical code and config.yaml file to launch and read data from a spark cluster a few months ago. This issue seems to be new.
Any help would be greatly appreciated!