rossf7 / elasticrawl

Launch AWS Elastic MapReduce jobs that process Common Crawl data.
MIT License
49 stars 13 forks source link

Elasticrawl

This blog post has a walkthrough of running the example jobs on the November 2014 crawl.

Installation

gem install elasticrawl --no-rdoc --no-ri

Troubleshooting

If you get the error "EMR service role arn:aws:iam::156793023547:role/EMR_DefaultRole is invalid" when launching a cluster then you don't have the necessary IAM roles. To fix this install the AWS CLI and run the command below.

aws emr create-default-roles 

Commands

elasticrawl init

The init command takes in an S3 bucket name and your AWS credentials. The S3 bucket will be created and will store your data and logs.

~$ elasticrawl init your-s3-bucket

Enter AWS Access Key ID: ************
Enter AWS Secret Access Key: ************

...

Bucket s3://elasticrawl-test created
Config dir /Users/ross/.elasticrawl created
Config complete

elasticrawl parse

The parse command takes in the crawl name and an optional number of segments and files to parse.

~$ elasticrawl parse CC-MAIN-2015-48 --max-segments 2 --max-files 3
Segments
Segment: 1416400372202.67 Files: 150
Segment: 1416400372490.23 Files: 124

Job configuration
Crawl: CC-MAIN-2015-48 Segments: 2 Parsing: 3 files per segment

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)
y

Job: 1420124830792 Job Flow ID: j-2R3MFE6TWLIUB

elasticrawl combine

The combine command takes in the results of previous parse jobs and produces a combined set of results.

~$ elasticrawl combine --input-jobs 1420124830792
Job configuration
Combining: 2 segments

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)
y

Job: 1420129496115 Job Flow ID: j-251GXDIZGK8HL

elasticrawl status

The status command shows crawls and your job history.

~$ elasticrawl status
Crawl Status
CC-MAIN-2015-48 Segments: to parse 98, parsed 2, total 100

Job History (last 10)
1420124830792 Launched: 2015-01-01 15:07:10 Crawl: CC-MAIN-2015-48 Segments: 2 Parsing: 3 files per segment

elasticrawl reset

The reset comment resets a crawl so it is parsed again.

~$ elasticrawl reset CC-MAIN-2015-48
Reset crawl? (y/n)
y
 CC-MAIN-2015-48 Segments: to parse 100, parsed 0, total 100

elasticrawl destroy

The destroy command deletes your S3 bucket and the ~/.elasticrawl directory.

~$ elasticrawl destroy

WARNING:
Bucket s3://elasticrawl-test and its data will be deleted
Config dir /home/vagrant/.elasticrawl will be deleted
Delete? (y/n)
y

Bucket s3://elasticrawl-test deleted
Config dir /home/vagrant/.elasticrawl deleted
Config deleted

Configuring Elasticrawl

The elasticrawl init command creates the ~/elasticrawl/ directory which contains

Development

Elasticrawl is developed in Ruby and requires Ruby 2.1.0 or later (Ruby 2.3 is recommended). The sqlite3 and nokogiri gems have C extensions which mean you may need to install development headers.

Gem Version Code Climate Build Status 2.0.0, 2.1.8, 2.2.4, 2.3.0

TODO

Thanks

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

License

This code is licensed under the MIT license.