nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Add links to new flintrock guides #243

Closed heathermiller closed 6 years ago

heathermiller commented 6 years ago

I just wrote up a guide for using flintrock to start up a Spark cluster and how to use it to submit jobs to your cluster. Have a look! Looks like it could be useful documentation for your project :)

Wonderful tool btw!

heathermiller commented 6 years ago

Could you capitalize Flintrock? The executable is indeed flintrock, but the tool and project are called Flintrock. It's similar to Docker vs. docker, or PySpark vs. pyspark.

Sure.

In part 2 of your guide you dedicate a section to setting up dependencies for S3 access. Flintrock makes a good effort to do this for you by configuring spark.jar.packages automatically to load hadoop-aws. Did that not work for you? You should not need to install hadoop-aws-2.7.2.jar aws-java-sdk-1.7.4.jar manually.

No, that doesn't work at all. It was quite painful to realize that Flintrock wasn't doing it when I thought it was. I don't know where Flintrock is putting the jars, but they're not available to the Spark application that I'm running when I use spark-submit.

heathermiller commented 6 years ago

PS, just unbolded the text and capitalized Flintrock in a new commit.

nchammas commented 6 years ago

It was quite painful to realize that Flintrock wasn't doing it when I thought it was. I don't know where Flintrock is putting the jars, but they're not available to the Spark application that I'm running when I use spark-submit.

We should probably fork this discussion to a dedicated issue, but to elaborate briefly here:

Flintrock configures a default for spark.jar.packages that spark-submit picks up automatically. I think the reason it didn't work for you, looking at your guide, is that you call spark-submit from your workstation rather than from the cluster. Or maybe it's that you used the cluster deploy mode? I'm not sure.

If you login to the cluster and just call pyspark or spark-shell from there you should see Spark loading all the hadoop-aws dependencies. It's the same as if you had called pyspark, spark-shell, or spark-submit with --packages "org.apache.hadoop:hadoop-aws:2.7.5". You can see more background on how we built this feature over on #180.

I'd like to fix this so that Flintrock does the right thing for S3 dependencies regardless of how the user submits applications. I'll file an issue for this. Do you mind if I pinged you from there for feedback?

heathermiller commented 6 years ago

That would be great!

And sure thing! Happy to test it out!

heathermiller commented 6 years ago

Oh, and since I was setting this guide up for students... It should be quite reproducible!

So if you just use the little example program in part 2 guide using spark-submit, and if you drop the part about manually copying the s3 jars to the cluster with Flintrock, you should see the typical NoClassDef found or whatever exception that's thrown when trying to use one of the S3 methods.

So that's at least one nice thing. Hopefully a reproducible example for this issue 😛