This addresses the issue described in #238, which is that the Spark project have stopped publishing releases to S3. This means that we now have to download Spark from the Apache mirror network.
While doing this work I also:
Unified how we download Hadoop and Spark since they are both coming from the same place now.
Bumped the default version of Hadoop to 2.8.4 and Spark to 2.3.0.
Added a test for the download script to make sure it works on both Python 2 and 3 (since it runs on the cluster nodes).
TODO:
[x] Add warnings about problems with Apache mirrors -- slow, unreliable, only latest releases -- and recommend users set custom download-source.
[x] Proactively check for package availability before launching cluster.
[x] Confirm building Spark from source still works when Hadoop 2.8.4. is configured.
[ ] ~Add wiki page explaining how to host Hadoop or Spark packages on a public S3 bucket.~
For a future PR:
[ ] Allow download-source to be set to an s3:// URL, and use the AWS CLI to download in that case. That will let users serve packages from private buckets which the cluster can access via IAM roles.
This addresses the issue described in #238, which is that the Spark project have stopped publishing releases to S3. This means that we now have to download Spark from the Apache mirror network.
While doing this work I also:
TODO:
download-source
.For a future PR:
download-source
to be set to ans3://
URL, and use the AWS CLI to download in that case. That will let users serve packages from private buckets which the cluster can access via IAM roles.