istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

Limit number of crawled pages per domain #103

Closed kazuar closed 6 years ago

kazuar commented 7 years ago

Hello,

I have the following question: How would you add the ability to control how many pages can be crawled from each domain? From example, if I want to configure that a crawler will only crawl 100 pages from each domain and will not queue any pages after reaching that limit?

Is there an existing way to do that or maybe you guys have any suggestion about how to implement this kind of behavior in scrapy-cluster? I was thinking that perhaps I could add additional properties to the redis queue in distributed_scheduler.py and maybe try to control it through there. What do you think?

As an example, in Frontera, you can implement this kind of behavior in the strategy worker and set MAX_PAGES_PER_HOSTNAME in the settings file.

Any suggestion would be much appreciated. Thanks!

madisonb commented 7 years ago

I am not sure it needs to be that complicated. You can create/modify a spider to increment a counter in redis based on the crawlid and domain of the crawl, and then refuse to yield new requests when your counter value max is reached. The same logic could be applied in a downloader middleware as well.

The only caution you need to make is to ensure you set an expire on the redis key, so that once you are done crawling that domain it does not persist permanently within your redis instance (and thus hogging up dead space).

Another option to consider is using that same redis key but add the filtering logic to the dupefilter (in your case it would turn into a generic filter) here. Again, you still would need to make sure your keys don't become stagnant, and could use a redis-monitor plugin for that perhaps.

If this is a feature request that should get more attention I would be happy to consider a PR or as part of 1.3

kazuar commented 7 years ago

@madisonb, thanks for the help!

I like the dupefilter approach better as it might have less chance of hitting a race condition when running multiple spiders. Thanks for the suggestion about the expire logic as I would have probably miss that 😄

If I'll manage to work it out on my own I'll send a PR.

kazuar commented 7 years ago

Created pull request #107 for this feature.

madisonb commented 6 years ago

Closing thanks to #165