Is there a way to limit of how many spiders of a certain project would run on Scrapyd?

aaronm137 commented 2 months ago

Hello,

I run about 200 spiders every day at certain time. With the default number of processes per CPU core (4) - I have two cores - these 200 spiders were finished in about ~7 hours. I wanted to speed this process up and fully leverage the potential of the server, so I increased the number of processes that can run on a single CPU core from 4 to 8, so in total, I could run 16 processes (spiders) at the same time. So, my thinking was that if with the limit of 8 processes at the same time can finish the job in 7 hours, 16 processes at the same time can do it in 3,5 hours.

How was I wrong... now, with processions 16 tasks at the same time, it takes about 28 hours to complete the job. The reason is that the website I am scraping gives me a lot of timeouts and retries. So, apparently, I'll need to revert back to 4 processes (maybe I can try 5-6 to see how to bounces will look like).

Anyway, that brings me to the main question here. I need to run two projects on this server - ProjectA (200 spiders) and ProjectB (50 spiders).

For ProjectA, I apparently need to limit the number of processes per 1 core at 4, so I don't overload the website I am scraping. But I also have ProjectB. One way to solve this situation is to run spiders of ProjectB after the spiders of ProjectA. So, say, ProjectA will run from midnight to 7am, and from 7am would run ProjectB.

But I don't like much this solution, as it is not very efficient, so I'm thinking if there's a way to set that Scrapyd would use 10 processes per 1 CPU core, from which ProjectA would utilise max 4 processes and ProjectB 6?

Hope it makes sense what I am trying to achieve.

jpmckinney commented 2 months ago

Scrapy has a CONCURRENT_REQUESTS_PER_DOMAIN setting, but this is only useful when the requests originate from the same spider (i.e. if instead of 200 spiders for the same website, you had 1).

I think the simplest solution would be to use a reverse proxy like HAProxy that can throttle your requests to these remote websites. That way, you can precisely control the request rate, instead of gambling on how many requests get made for a given amount of CPU time. This also makes your project more portable if you move to a server with more cores, etc.

Scrapyd has max_proc_per_cpu (and max_proc) but this isn't configurable per project. That's because max_proc_per_cpu is intended to control CPU utilization (e.g. if your server also needs to run other processes); it is not intended as an indirect way to control the number of requests.

That said, you can run two instances of Scrapyd, each with different max_proc_per_cpu (or max_proc) settings, if that's really how you want to control the number of requests.

If you don't want to do either of those, then I suppose you can make all your spiders use a custom Scrapy extension that increments the number of running spiders (e.g. in a SQLite file) on the spider_opened signal, and then loops (e.g. using twisted.internet.reactor.callLater) until the number is low enough to proceed, and finally decrements it on the spider_closed signal. However, this is very complicated, so I recommend one of the two alternatives.

jpmckinney commented 1 month ago

FWIW I see #139 had implemented a limit of processes per project, but the author abandoned the PR that same day and started a much more complicated #140. It involves a change to the Poller interface to give it access to the launcher (not sure about that change).

scrapy / scrapyd

Is there a way to limit of how many spiders of a certain project would run on Scrapyd? #505