scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.95k stars 570 forks source link

scrapy's download concurrency limits do not apply to parallel runs #221

Closed joaqo closed 3 years ago

joaqo commented 7 years ago

Hi, I am running multiple spiders concurrently, all of them scraping the same domain. I would like to be able to limit the download rate to this domain using the DOWNLOAD_DELAY scrapy setting.

The problem is that, after running some tests, I found out that this setting only limits the rate for each spider separately. So if I run 3 spiders at the same time I would be downloading 3 times faster than my DOWNLOAD_DELAY limit would have you believe.

Is there a way to limit the download delay for all spiders to one domain?

redapple commented 7 years ago

Interesting use-case. Each spider crawl will run in its own process, and will not share its state with others, that's true. I'm not sure if there's an easy way to make them share their downloader information. Maybe you can use a common proxy for all spiders, which would do the throttling.

Related question: why do you need to use multiple spiders concurrently for the same domain? Scrapy should be able to run more concurrent requests with 1 spider provided you set CONCURRENT_* settings.

joaqo commented 7 years ago

@redapple Thanks for the answer. The sites we are crawling are quite large, so it makes sense for us to divide our code into separate spiders, to scrape different parts of the site.

hbbtstar commented 7 years ago

@joaqo I've encountered a similar problem and we got around it by using a custom downloader that pulls and stores the values from Redis instead of locally (it's not the greatest way to do this, but it's worked out for us so far).

Digenis commented 6 years ago

@joaqo, I understand that it makes sense to divide the code but why divide it in separate spider instances and not just separate classes, being inherited from the same spider instance? Do they get scheduled on different frequencies? Do they need a lot of parametrization through spider arguments?

I don't see how scrapyd can integrate a custom scrapy downloader but we are open to ideas.

(P.S. I changed the title to something more abstract, I hope you don't mind)

jpmckinney commented 3 years ago

Closing as discussion received no further replies.

HairlessVillager commented 3 months ago

@joaqo I've encountered a similar problem and we got around it by using a custom downloader that pulls and stores the values from Redis instead of locally (it's not the greatest way to do this, but it's worked out for us so far).

Hi, I encountered the same question. I want to solve it with RabbitMQ or Redis. Could you share your solution?