scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.92k stars 569 forks source link

SQLite queue is using all CPU on high frequency poller (<1s) #475

Closed pspsdev closed 2 months ago

pspsdev commented 1 year ago

When running spiders that do nothing at all, the sqlite based poller uses all cpu just reading scheduled tasks. It would be good to have a plug and play alternative queues like redis.

pspsdev commented 1 year ago

Related: https://github.com/scrapy/scrapyd/issues/197

jpmckinney commented 1 year ago

Why are you running spiders that "do nothing at all"?

pspsdev commented 1 year ago

@jpmckinney just to rule out that cpu is being used by a spider. This can be replicated when scheduling a lot of jobs and polling rate is below a second e.g 0.1. SQLite queue will use massive ammount of cpu.

pspsdev commented 1 year ago

There are also some unmaintained repos that tries to solve this: https://github.com/speakol-ads/scrapyd-redis

Simply the sqlite queue is a really bad option for high frequency queues.

jpmckinney commented 1 year ago

Hmm, yeah, same with https://github.com/Tiago-Lira/scrapyd-mongodb (from which scrapyd-redis is forked) and https://github.com/balena/python-pqueue (mentioned in #197).

https://github.com/peter-wangxu/persist-queue is still active, though maybe a first attempt is to switch to https://github.com/scrapy/queuelib as mentioned in #197.

Can you share your setup for reproducing the issue?

pspsdev commented 1 year ago

I will try to create a demo later, but it's pretty much can be empty scrapyd service running with 1 spider that does nothing. Then creating like 50 schedules per second and making polling rate 0.1. It will destroy powerful cpu.

pspsdev commented 1 year ago

Also, in my personal opinion I would say it would make sense to add interface to add your own queue backend instead of doing hacks like those 2 repos mentioned above.

pspsdev commented 1 year ago

And then later sqlite can be switch to some other default is needed, but having a simple method to replace the queue on your own would be a very good option to quick solve this problem for those who use high frequency polling

jpmckinney commented 1 year ago

Do you have your own queue ready to use? You can try it with this PR: https://github.com/scrapy/scrapyd/pull/476

pspsdev commented 1 year ago

@jpmckinney thanks, give me a few hours I will try it out.

jpmckinney commented 1 year ago

@pspsdev Now that #476 is merged, do you have suggestions for how to edit the default spider queue, or should there just be a note in the documentation that it doesn't perform well under high frequency polling, and a custom queue would be better?

pspsdev commented 1 year ago

@jpmckinney I am still doing some tests on my end, give me a few days I will report with more details.

jpmckinney commented 4 weeks ago

FWIW, I can't replicate this issue. I set poll_interval = 0.1 and scheduled 100 jobs in a loop.