Since then that instance is now considered a "previous generation" amazon instance (they are on m3 now).
They also have a bunch of other instances that might not have been available when the documentation and script was initially written, some optimized for different things, as opposed to the M series, which provides a balance.
http://aws.amazon.com/ec2/instance-types/
It's also suggested that the CC data be hosted on s3, in which case network performance might be less critical for the machine?
Do you have any recommendations about what machine might be optimal with and without CC hosted on s3?
The documentation recommends using a specific instance, m1.xlarge EC2 instance "for a fast connection to S3" https://github.com/trivio/common_crawl_index#using-the-remote_copy-utility-script
Since then that instance is now considered a "previous generation" amazon instance (they are on m3 now).
They also have a bunch of other instances that might not have been available when the documentation and script was initially written, some optimized for different things, as opposed to the M series, which provides a balance. http://aws.amazon.com/ec2/instance-types/
It's also suggested that the CC data be hosted on s3, in which case network performance might be less critical for the machine?
Do you have any recommendations about what machine might be optimal with and without CC hosted on s3?