remote_copy script to copy pages looked-up with the index to your s3 bucket

Hi, this pull request adds a script "remote_copy", which given a domain or list of domains, looks them up in the index, downloads the relevant webpage bytes from the Crawl on S3, concatenates them back into larger files, and reuploads them to an S3 location of your choosing. It uses python multiprocessing and a cache of boto s3 key objects to increase performance.

The download/re-upload process can be slow due to S3 API response latency (a separate request has to be made for each key), but selecting just the relevant bytes from each segment file gets rid of a lot of unnecessary data and is useful if you only need a very small subset of the crawl. Running the script on an EC2 instance, I was able to copy 3.3GB of compressed data in about an hour.

A note with example usage is added to the bottom of README.md

trivio / common_crawl_index

remote_copy script to copy pages looked-up with the index to your s3 bucket #11