[Bug]: Queue events does not always capture all completed events

bobthekingofegypt commented 1 month ago

Version

v5.7.1

Platform

NodeJS

What happened?

We recently updated from the old bull to bullmq in a legacy project.
With the old bull we used to do some strange wrapping and monitoring of this legacy project from our newer orchestration tool, the tool used the completed callbacks from the queues to understand when this legacy stage was finished processing everything. This always worked fine. After switching over to bullmq we have noticed that the orchestrator is not understanding that the queue work is complete. We found bull had processed all the jobs fine, they have been saved of to our database no problem but the orchestrator process is always waiting for a few callbacks so that the submitted event count equals the completed callback count. But those callbacks never arrive. Not sure if we have done something wrong, or if this is expected behaviour.

In summary:

lots of workers doing the work
single process monitoring queue for completed event, submits X events, waits till X == completed callback count to stop the workers

How to reproduce.

https://github.com/bobthekingofegypt/check_bull_complete_count

I uploaded this repo as a minimum test. It contains a monitor, consumer and producer. The monitor listens for the complete events, node:cluster starts up loads of workers and producer submits 1million events. But the monitor doesn't always seem to register 1million callbacks. I'm testing this on a 20 core machine.

Relevant log output

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

manast commented 1 month ago

I think it is possible this has to do with the max events length, as you are processing very quickly, the Redis stream holding the events may get trimmed before the queue events manage to read the events, you can try to increase this setting to a larger value to see if it improves. https://api.docs.bullmq.io/interfaces/v5.QueueOptions.html#streams Default is 10k, you could try with 100k instead.

bobthekingofegypt commented 1 month ago

Tried it with 100k, sadly no difference.

I originally had the reproducible test case running with random sleeps to more match our production machines throughput but when I saw the same issue without them I just removed them for simplicity. Our production machines don't consume events super quickly; the completion monitor is attached to the end queue of a stream of processors. That queue has the task of saving to postgres, so it's throughput isn't very high.

taskforcesh / bullmq