rossf7 / elasticrawl

Launch AWS Elastic MapReduce jobs that process Common Crawl data.
MIT License
49 stars 13 forks source link


This blog post has a walkthrough of running the example jobs on the November 2014 crawl.


gem install elasticrawl --no-rdoc --no-ri


If you get the error "EMR service role arn:aws:iam::156793023547:role/EMR_DefaultRole is invalid" when launching a cluster then you don't have the necessary IAM roles. To fix this install the AWS CLI and run the command below.

aws emr create-default-roles 


elasticrawl init

The init command takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created and will store your data and logs.

~$ elasticrawl init your-s3-bucket

Enter AWS Access Key ID: ************
Enter AWS Secret Access Key: ************


Bucket s3://elasticrawl-test created
Config dir /Users/ross/.elasticrawl created
Config complete

elasticrawl parse

The parse command takes in the crawl name and an optional number of segments and files to parse.

~$ elasticrawl parse CC-MAIN-2015-48 --max-segments 2 --max-files 3
Segment: 1416400372202.67 Files: 150
Segment: 1416400372490.23 Files: 124

Job configuration
Crawl: CC-MAIN-2015-48 Segments: 2 Parsing: 3 files per segment

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)

Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB

elasticrawl combine

The combine command takes in the results of previous parse jobs and produces a combined set of results.

~$ elasticrawl combine --input-jobs 1420124830792
Job configuration
Combining: 2 segments

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)

Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL

elasticrawl status

The status command shows crawls and your job history.

~$ elasticrawl status
Crawl Status
CC-MAIN-2015-48 Segments: to parse 98, parsed 2, total 100

Job History (last 10)
1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2015-48 Segments: 2 Parsing: 3 files per segment

elasticrawl reset

The reset comment resets a crawl so it is parsed again.

~$ elasticrawl reset CC-MAIN-2015-48
Reset crawl? (y/n)
 CC-MAIN-2015-48 Segments: to parse 100, parsed 0, total 100

elasticrawl destroy

The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.

~$ elasticrawl destroy

Bucket s3://elasticrawl-test and its data will be deleted
Config dir /home/vagrant/.elasticrawl will be deleted
Delete? (y/n)

Bucket s3://elasticrawl-test deleted
Config dir /home/vagrant/.elasticrawl deleted
Config deleted

Configuring Elasticrawl

The elasticrawl init command creates the ~/elasticrawl/ directory which contains


Elasticrawl is developed in Ruby and requires Ruby 2.1.0 or later (Ruby 2.3 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.

Gem Version Code Climate Build Status 2.0.0, 2.1.8, 2.2.4, 2.3.0




  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request


This code is licensed under the MIT license.