trivio / common_crawl_index

Index URLs in Common Crawl
193 stars 48 forks source link

Recommended Instance in Documentation #24

Open wpdevs opened 10 years ago

wpdevs commented 10 years ago

The documentation recommends using a specific instance, m1.xlarge EC2 instance "for a fast connection to S3" https://github.com/trivio/common_crawl_index#using-the-remote_copy-utility-script

Since then that instance is now considered a "previous generation" amazon instance (they are on m3 now).

They also have a bunch of other instances that might not have been available when the documentation and script was initially written, some optimized for different things, as opposed to the M series, which provides a balance. http://aws.amazon.com/ec2/instance-types/

It's also suggested that the CC data be hosted on s3, in which case network performance might be less critical for the machine?

Do you have any recommendations about what machine might be optimal with and without CC hosted on s3?