trivio / common_crawl_index

Index URLs in Common Crawl
193 stars 48 forks source link

remote_copy script to copy pages looked-up with the index to your s3 bucket #11

Closed jspacker closed 11 years ago

jspacker commented 11 years ago

Hi, this pull request adds a script "remote_copy", which given a domain or list of domains, looks them up in the index, downloads the relevant webpage bytes from the Crawl on S3, concatenates them back into larger files, and reuploads them to an S3 location of your choosing. It uses python multiprocessing and a cache of boto s3 key objects to increase performance.

The download/re-upload process can be slow due to S3 API response latency (a separate request has to be made for each key), but selecting just the relevant bytes from each segment file gets rid of a lot of unnecessary data and is useful if you only need a very small subset of the crawl. Running the script on an EC2 instance, I was able to copy 3.3GB of compressed data in about an hour.

A note with example usage is added to the bottom of README.md

srobertson commented 11 years ago

This looks really cool! I don't have time to do a full look at it, so I'm going to do the uncovential thing and add you as a committer. Feel free to merge it in, if you think it's up to snuff.