Redis BUSY Error When Using BullMQ Pro with Multiple Worker Processes

jimvandervoort commented 2 months ago

I’m encountering an issue with BullMQ Pro where both the worker and server are affected by the BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE error.

I'm running 10 NodeJS workers and 1 server. Workers are started with:

const workerOptions = {
  concurrency: 25,
  batch: {
    size: 100,
  },
};

Expected Behavior: I expected that this error would only cause the worker to fail temporarily without impacting the server’s ability to place events in the queue.

Actual Behavior: When the error occurs, the server also fails to place events into the queue, causing us to miss important events.

Questions:

Is there a way to isolate the worker failure so that it does not affect the server’s ability to enqueue events?
Any config suggestions?
Can you help me debug which LUA script is running at a given time so I can further debug the script?

Any help would be greatly appreciated!

manast commented 2 months ago

As 100 is not a very large amount of jobs, I wonder if the data sent per job is very large and that is causing the Lua script to work too slow? Also, have you tried to decrease the batch size to see if that is what is actually causing the issue?

jimvandervoort commented 2 months ago

Hi and thank you for the reply :)

I will try to run with a different batch size but I log the size of the batches that are picked up by workers and going through the results I can see they rarely accumulate more than 10 jobs.

CleanShot 2024-09-27 at 10 13 58@2x

The payload is a JSON document that does not vary in length very much at all, as a minified string no more than 1K characters.

jimvandervoort commented 2 months ago

Might be good to note about half the jobs have a delay set of 31000

manast commented 2 months ago

Do you have the possibility to monitor the Redis instance CPU usage? this will help us see if there are some problematic spikes that we need to take care of.

jimvandervoort commented 2 months ago

I was monitoring the total CPU usage of the server earlier but after your comment just realized Redis is single threaded and I see it is maxing out a single core. I think this means we need either better per-core performance or we might need to run multiple redis instances on this server.

Let me know if what I'm saying doesn't make sense.

CleanShot 2024-09-27 at 11 01 57

For the record I just ran a loadtest with a batch size of 1 and got the same BUSY errors as earlier around the same time.

manast commented 2 months ago

Yes, thats definitely a problem, like it should never go up more than 80%. Multicore will not help you in this case unless you also divide the job in different queues, like one queue per available Redis node. But you can also increase the single CPU to a more performant one. If you plan to use cluster you should read this info:https://docs.bullmq.io/bull/patterns/redis-cluster

There is also more info on the DragonflyDB site, which using the same technique of key slots you can take advantage of multicore CPUs: https://www.dragonflydb.io/docs/integrations/bullmq#2-queue-naming-strategies

manast commented 2 months ago

Btw, how many jobs are you processing per second? The reason I am asking is because out of the box Redis can handle quite a lot unless it is a very weak machine.

jimvandervoort commented 2 months ago

Using 2 machines running 56 core Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

We need to be able to safely process 4K jobs per second.

I just ran a test where we setup 2 Redis instances per server and also 2 instances of our worker app and our ingestion API per server. The ingestion API places jobs in the queue from an HTTP request and the worker picks up jobs. Each worker and ingestion API is connected to a different Redis instance.

That allows us to process at least 8K jobs/s, somewhere above that we get the BUSY error again (which make sense because we previously managed around 4K/s with a 1 Redis instance per server)

It's actually easy for us to split up our queue into multiple instances and connect them to different Redis instances. Now it's for us to decide whether we want to run the ingestion api/worker/redis separately or not. I do like the idea of running a Redis instance per server so we do not have a single point of failure. And because it only uses a single core on a 56 core machine there is enough processing room left for at least the ingestion api and possibly the workers too.

manast commented 2 months ago

Since you are hosting your own Redis instance, have you tried with Dragonfly, with 56 cores it seems you have a huge amount of power for processing orders of magnitude more than your requirement.

It sounds strange to me that 4k jobs per sec. would saturate a standard Redis though, that is not a large amount in that hardware. Any chance you could create an example case that reproduces this saturation so that I can run it on my own and investigate it in more detail?

manast commented 2 months ago

@jimvandervoort let me know how this issue is going.

jimvandervoort commented 2 months ago

@manast just read your reply. Thank you for the great help so far! Yep we where looking into dragonfly. For now we have a few more components we need to test and 8K is enough for us, so in order to meet our deadlines we'll run a few more tests to see what other bottlenecks we can find, then we'll get to work improving throughput.

I'd like to create a minimal reproduction example, great idea. Will get back to you when I have one this week or next :)

manast commented 2 months ago

Btw, you can also use this simple repo to test performance: https://github.com/taskforcesh/bullmq-bench For example, in my local machine (M2 Pro) I get 46k jobs added per second using bulk, and 30k+ processed jobs per second with worker concurrency factor 100.

taskforcesh / bullmq-pro-support

Redis BUSY Error When Using BullMQ Pro with Multiple Worker Processes #79