taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

How to setup in a cluster environment? #8

Closed dbuarque closed 10 years ago

dbuarque commented 10 years ago

This is really a awesome project, however I didn't figure out how to setup in a cluster environment. can you guys provide a simple cluster(separated machines) setup?

taganaka commented 10 years ago

Hi! A minimal cluster setup is actually really easy.

1) A Redis endpoint where all of the instances running polipus have access to 2) A page storage endpoint (MongoDB or S3) where all of the instances running polipus have access to

You can even have Redis and Mongo running on the same instance, this really depend on the size of your crawling session

Once storage and Redis is configured is just a matter of running multiple process of polipus on multiple instances, just deploying the same software. The underlying use of Redis to dispatch urls to crawl does the rest.

Take a look at this example: https://github.com/taganaka/polipus/blob/master/examples/basic.rb

Just set the right connection endpoint for Redis and Mongo end you will be fine

dbuarque commented 10 years ago

Awesome! Thanks a lot! Polipus is really awesome!