nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Unable to access S3- ClassNotFoundException #257

Closed cwithrowss closed 6 years ago

cwithrowss commented 6 years ago

I will say upfront this doesn't seem to fundamentally be a Flintrock issue, since google shows an endless stream of particular configuration issues with accessing S3 from Spark, but I thought I'd post anyway since it isn't working in any spark/hadoop versions I've tried (though errors vary).

When I launch a cluster and open spark-shell to read in a file, using spark2.3.1/ hadoop 2.9.1 I get this error : scala> textFile.count() java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) ... Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.StorageStatistics at java.net.URLClassLoader.findClass(URLClassLoader.java:381) .. etc. I have no trouble accessing directly with aws in cli. I assume any FileSystems issues are supposed to mean the path to hadoop-aws jar is missing or wrong version for build, as mentioned here. However,

The spark default conf file shows: spark.jars.packages org.apache.hadoop:hadoop-aws:2.9.1

And when I load spark-shell the correct dependency versions load, according to maven repo: org.apache.hadoop#hadoop-aws added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-44f335b2-5841-4347-8307-4c8ca703e83d;1.0 confs: [default] found org.apache.hadoop#hadoop-aws;2.9.1 in central found com.amazonaws#aws-java-sdk-bundle;1.11.199 in central found org.apache.commons#commons-lang3;3.4 in central :: resolution report :: resolve 574ms :: artifacts dl 9ms :: modules in use: com.amazonaws#aws-java-sdk-bundle;1.11.199 from central in [default] org.apache.commons#commons-lang3;3.4 from central in [default] org.apache.hadoop#hadoop-aws;2.9.1 from central in [default]

When I tried other combinations I got errors about invalid access, that was supposed to mean I was trying to use different hadoop versions in one cluster. Any ideas? Thank you for your time (if you have it!)

nchammas commented 6 years ago

See #254 and try hadoop-aws:2.7.6, even though you've deployed Hadoop 2.9.1.

cwithrowss commented 6 years ago

That worked! Thank you. I'd tried other --packages but not that particular one.

nchammas commented 6 years ago

Glad it did. You ended up being an unwitting test subject for #254. 😄