scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.96k stars 569 forks source link

Use a single spider queue, for deterministic priority across all projects #187

Open Digenis opened 8 years ago

Digenis commented 8 years ago

Scheduled jobs are not run in FIFO+priority order. Instead, there are multiple queues that are also arranged in a queue-like fashion but not round-robin or anything, just an "arbitrary but constant" order (python dict).

A scheduled job with the current highest priority will have to wait for all other jobs that reside in all queues that the dict-order puts in front of its queue

The poller demonstrates "preferences" for some projects over others and lets them monopolize process slots.

The following code in poller.py polls projects in whatever order iteritems() uses which is arbitrary but constant.

for p, q in self.queues.iteritems():
    c = yield maybeDeferred(q.count)
    if c:
        msg = yield maybeDeferred(q.pop)
        returnValue(self.dq.put(self._message(msg, p)))

This becomes obvious when multiple projects have several spiders queued and waiting for slots. The first (project, queue) pair returned from iteritems() that has queued spiders will prevent any further projects from being polled. If its queue is constantly full only this project will have its spiders ran.

Possible solutions:

One way to solve this is suggested by @luc4smoreira in #140. However it can't be a correct solution (for this issue) because it can make total slot utilization vary a lot. E.g. with 5 slots per project and 10 projects, when 1 project is active there will be only 5 slots utilized (even if the single active project is in need for more) but when all 10 are active, there will be 50 slots. In a workaround for the above, one can set the total slot limit just above the slots per project multiplied by the number of projects expected to monopolize slots so that there will always be a few slots for others.

Another incomplete solution is to use the spider priority earlier in the poller before picking a project. When I noticed this problem, I mistakenly tried lowering the priority of spiders belonging to the project monopolizing the slots. Of course it didn't work ­― the priority is only within the project queue and not the set of queues. Update 2019-06-26: A for-loop in the poller, querying all queues, if more processes (threads too) write/read queues, will cause race conditions to occur more often.

We can also stop using multiple queues and keep everything in a single database. Although it sounds like a big change, by doing it we'll drop a lot of code from scrapyd and we can even implement both of the 2 suggestions above cleaner, faster and with less code.

Digenis commented 8 years ago

The easiest way to fix this is pick the queue to poll at random instead of the iteritems order, like in the following dirty workaround:

--- a/scrapyd/poller.py
+++ b/scrapyd/poller.py
@@ -1,3 +1,4 @@
+from random import sample
 from zope.interface import implements
 from twisted.internet.defer import DeferredQueue, inlineCallbacks, maybeDeferred, returnValue

@@ -17,11 +18,12 @@ class QueuePoller(object):
     def poll(self):
         if self.dq.pending:
             return
-        for p, q in self.queues.iteritems():
-            c = yield maybeDeferred(q.count)
-            if c:
-                msg = yield maybeDeferred(q.pop)
-                returnValue(self.dq.put(self._message(msg, p)))
+
+        for proj, queue in sample(self.queues.items(), len(self.queues)):
+            if (yield maybeDeferred(queue.count)):
+                msg = (yield maybeDeferred(queue.pop)).copy()
+                msg = dict(msg, _project=proj, _spider=msg.pop('name'))
+                returnValue(self.dq.put(msg))

     def next(self):
         return self.dq.get()

This is not the kind of fix that we'll release but I provide it here for reference and for users in need of a patch.

This also makes me ask if it's possible that a queue is deleted in the reactor loops between sample() and returnValue. If so, then it can be a race condition.

Tarliton commented 7 years ago

Hello

Stopping to use multiple queues sounds good, what would be the requirements of an acceptable solution for this?

thanks

Digenis commented 7 years ago

Any solution is more or less backwards incompatible. It can't make it to 1.2

The requirements are: The poller and/or spider queue should return the jobs using the same order they enter the queue/queues (from the scheduler). In a single database approach (which I prefer) one should also demonstrate that there are no concurrency issues.

Digenis commented 5 years ago

Similar: https://github.com/scrapy/scrapy/issues/3666 https://github.com/scrapy/scrapy/issues/3679

jpmckinney commented 3 months ago

The tests in #344 #349 might be useful.

Can also consider #343

Idea from https://github.com/scrapy/scrapyd/issues/197#issuecomment-270340046

sqlite priority queue [...] may also be eventually replaced by https://github.com/scrapy/queuelib

Edit: Removing priority milestone as I don’t know how many users actually use priorities. Also, the ideal way to do this is a breaking change.