Closed joaqo closed 3 years ago
Interesting use-case. Each spider crawl will run in its own process, and will not share its state with others, that's true. I'm not sure if there's an easy way to make them share their downloader information. Maybe you can use a common proxy for all spiders, which would do the throttling.
Related question: why do you need to use multiple spiders concurrently for the same domain? Scrapy should be able to run more concurrent requests with 1 spider provided you set CONCURRENT_* settings.
@redapple Thanks for the answer. The sites we are crawling are quite large, so it makes sense for us to divide our code into separate spiders, to scrape different parts of the site.
@joaqo I've encountered a similar problem and we got around it by using a custom downloader that pulls and stores the values from Redis instead of locally (it's not the greatest way to do this, but it's worked out for us so far).
@joaqo, I understand that it makes sense to divide the code but why divide it in separate spider instances and not just separate classes, being inherited from the same spider instance? Do they get scheduled on different frequencies? Do they need a lot of parametrization through spider arguments?
I don't see how scrapyd can integrate a custom scrapy downloader but we are open to ideas.
(P.S. I changed the title to something more abstract, I hope you don't mind)
Closing as discussion received no further replies.
@joaqo I've encountered a similar problem and we got around it by using a custom downloader that pulls and stores the values from Redis instead of locally (it's not the greatest way to do this, but it's worked out for us so far).
Hi, I encountered the same question. I want to solve it with RabbitMQ or Redis. Could you share your solution?
Hi, I am running multiple spiders concurrently, all of them scraping the same domain. I would like to be able to limit the download rate to this domain using the DOWNLOAD_DELAY scrapy setting.
The problem is that, after running some tests, I found out that this setting only limits the rate for each spider separately. So if I run 3 spiders at the same time I would be downloading 3 times faster than my DOWNLOAD_DELAY limit would have you believe.
Is there a way to limit the download delay for all spiders to one domain?