Closed kazuar closed 6 years ago
I am not sure it needs to be that complicated. You can create/modify a spider to increment a counter in redis based on the crawlid
and domain
of the crawl, and then refuse to yield new requests when your counter value max is reached. The same logic could be applied in a downloader middleware as well.
The only caution you need to make is to ensure you set an expire
on the redis key, so that once you are done crawling that domain it does not persist permanently within your redis instance (and thus hogging up dead space).
Another option to consider is using that same redis key but add the filtering logic to the dupefilter
(in your case it would turn into a generic filter) here. Again, you still would need to make sure your keys don't become stagnant, and could use a redis-monitor plugin for that perhaps.
If this is a feature request that should get more attention I would be happy to consider a PR or as part of 1.3
@madisonb, thanks for the help!
I like the dupefilter
approach better as it might have less chance of hitting a race condition when running multiple spiders.
Thanks for the suggestion about the expire
logic as I would have probably miss that 😄
If I'll manage to work it out on my own I'll send a PR.
Created pull request #107 for this feature.
Closing thanks to #165
Hello,
I have the following question: How would you add the ability to control how many pages can be crawled from each domain? From example, if I want to configure that a crawler will only crawl 100 pages from each domain and will not queue any pages after reaching that limit?
Is there an existing way to do that or maybe you guys have any suggestion about how to implement this kind of behavior in
scrapy-cluster
? I was thinking that perhaps I could add additional properties to the redis queue indistributed_scheduler.py
and maybe try to control it through there. What do you think?As an example, in Frontera, you can implement this kind of behavior in the strategy worker and set
MAX_PAGES_PER_HOSTNAME
in the settings file.Any suggestion would be much appreciated. Thanks!