Closed heathermiller closed 6 years ago
Could you capitalize Flintrock? The executable is indeed flintrock, but the tool and project are called Flintrock. It's similar to Docker vs. docker, or PySpark vs. pyspark.
Sure.
In part 2 of your guide you dedicate a section to setting up dependencies for S3 access. Flintrock makes a good effort to do this for you by configuring spark.jar.packages automatically to load hadoop-aws. Did that not work for you? You should not need to install hadoop-aws-2.7.2.jar aws-java-sdk-1.7.4.jar manually.
No, that doesn't work at all. It was quite painful to realize that Flintrock wasn't doing it when I thought it was. I don't know where Flintrock is putting the jars, but they're not available to the Spark application that I'm running when I use spark-submit
.
PS, just unbolded the text and capitalized Flintrock in a new commit.
It was quite painful to realize that Flintrock wasn't doing it when I thought it was. I don't know where Flintrock is putting the jars, but they're not available to the Spark application that I'm running when I use spark-submit.
We should probably fork this discussion to a dedicated issue, but to elaborate briefly here:
Flintrock configures a default for spark.jar.packages
that spark-submit
picks up automatically. I think the reason it didn't work for you, looking at your guide, is that you call spark-submit
from your workstation rather than from the cluster. Or maybe it's that you used the cluster deploy mode? I'm not sure.
If you login to the cluster and just call pyspark
or spark-shell
from there you should see Spark loading all the hadoop-aws
dependencies. It's the same as if you had called pyspark
, spark-shell
, or spark-submit
with --packages "org.apache.hadoop:hadoop-aws:2.7.5"
. You can see more background on how we built this feature over on #180.
I'd like to fix this so that Flintrock does the right thing for S3 dependencies regardless of how the user submits applications. I'll file an issue for this. Do you mind if I pinged you from there for feedback?
That would be great!
And sure thing! Happy to test it out!
Oh, and since I was setting this guide up for students... It should be quite reproducible!
So if you just use the little example program in part 2 guide using spark-submit, and if you drop the part about manually copying the s3 jars to the cluster with Flintrock, you should see the typical NoClassDef found or whatever exception that's thrown when trying to use one of the S3 methods.
So that's at least one nice thing. Hopefully a reproducible example for this issue 😛
I just wrote up a guide for using flintrock to start up a Spark cluster and how to use it to submit jobs to your cluster. Have a look! Looks like it could be useful documentation for your project :)
Wonderful tool btw!