Download Hadoop and Spark from Apache mirrors by default - Githubissues

nchammas / flintrock

A command-line tool for launching Apache Spark clusters.

Apache License 2.0

636 stars 116 forks source link

Download Hadoop and Spark from Apache mirrors by default #249

Closed nchammas closed 6 years ago

nchammas commented 6 years ago

This addresses the issue described in #238, which is that the Spark project have stopped publishing releases to S3. This means that we now have to download Spark from the Apache mirror network.

While doing this work I also:

Unified how we download Hadoop and Spark since they are both coming from the same place now.
Bumped the default version of Hadoop to 2.8.4 and Spark to 2.3.0.
Added a test for the download script to make sure it works on both Python 2 and 3 (since it runs on the cluster nodes).

TODO:

[x] Add warnings about problems with Apache mirrors -- slow, unreliable, only latest releases -- and recommend users set custom download-source.
[x] Proactively check for package availability before launching cluster.
[x] Confirm building Spark from source still works when Hadoop 2.8.4. is configured.
[ ] ~Add wiki page explaining how to host Hadoop or Spark packages on a public S3 bucket.~

For a future PR:

[ ] Allow download-source to be set to an s3:// URL, and use the AWS CLI to download in that case. That will let users serve packages from private buckets which the cluster can access via IAM roles.