Find a way to put AMPCamp dataset to cluster HDFS

AMPCamp big data mini course uses ~20GB dataset which is stored on S3. This dataset has to be downloaded to master's local disk and then put to cluster HDFS.

Downloading from S3 to local disk of virtual instance on SL takes ~20 min. Downloading same dataset from SL Object Storage takes ~6 min, but there is a problem: on SL Object Store it is not possible to make object public (as it is possible in S3). It is possible though to enable CDN on objects stored in SL and they can be downloaded via HTTP url. However, in this case dataset should be archived and stored as bulk file instead of multiple separate files, as it was originally.

irifed / ansible-bdas

Find a way to put AMPCamp dataset to cluster HDFS #3