rossf7 / elasticrawl

Launch AWS Elastic MapReduce jobs that process Common Crawl data.
MIT License
49 stars 13 forks source link

File counts #1

Closed rossf7 closed 9 years ago

rossf7 commented 9 years ago

Get the segment names and file counts by downloading the warc.paths.gz file rather than calling the S3 API. Show the files counts for each segment in the CLI output and EMR Job Steps. Update README and move walkthrough to new blog post as its too long.

Upgrade Ruby versions for Vagrant and Travis CI. Upgrade Gem dependencies. Replace mocha with rspec-mocks. Bump Gem version to 1.1.0.