irifed / ansible-bdas

Ansible recipes for Berkeley Data Analytics Stack deployment
Apache License 2.0
16 stars 14 forks source link

Find a way to put AMPCamp dataset to cluster HDFS #3

Open irifed opened 9 years ago

irifed commented 9 years ago

AMPCamp big data mini course uses ~20GB dataset which is stored on S3. This dataset has to be downloaded to master's local disk and then put to cluster HDFS.

Downloading from S3 to local disk of virtual instance on SL takes ~20 min. Downloading same dataset from SL Object Storage takes ~6 min, but there is a problem: on SL Object Store it is not possible to make object public (as it is possible in S3). It is possible though to enable CDN on objects stored in SL and they can be downloaded via HTTP url. However, in this case dataset should be archived and stored as bulk file instead of multiple separate files, as it was originally.