nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Move hadoop-aws from default package to README recommendation #254

Closed nchammas closed 6 years ago

nchammas commented 6 years ago

Following up on @heathermiller's comments here, I think this fixes s3a:// for folks using spark-submit --deploy-mode cluster from a remote client.

It's difficult to come up with a completely seamless way to setup s3a:// due to all the incompatibilities you have to work around, but I think that instructing users to use --packages ...:hadoop-aws is the best I can do at this time at the intersection of "easy to maintain" and "works for the user".

I don't have a Scala toolchain setup on my machine, so I couldn't directly confirm that this eliminates the need to manually copy AWS-related jars, as described here, and PySpark does not support cluster deploy mode with a standalone master, so I couldn't test it that way either. But I'd expect this to work since it works when I interactively start a PySpark shell from the master.

@heathermiller - If you have the time to test that calling spark-submit --deploy-mode cluster --packages ...:hadoop-aws eliminates the need to manually install hadoop-aws-2.7.2.jar and aws-java-sdk-1.7.4.jar on the cluster, that would be great. No worries otherwise.

Related to #180.