Open jimvandervoort opened 2 months ago
As 100 is not a very large amount of jobs, I wonder if the data sent per job is very large and that is causing the Lua script to work too slow? Also, have you tried to decrease the batch size to see if that is what is actually causing the issue?
Hi and thank you for the reply :)
I will try to run with a different batch size but I log the size of the batches that are picked up by workers and going through the results I can see they rarely accumulate more than 10 jobs.
The payload is a JSON document that does not vary in length very much at all, as a minified string no more than 1K characters.
Might be good to note about half the jobs have a delay set of 31000
Do you have the possibility to monitor the Redis instance CPU usage? this will help us see if there are some problematic spikes that we need to take care of.
I was monitoring the total CPU usage of the server earlier but after your comment just realized Redis is single threaded and I see it is maxing out a single core. I think this means we need either better per-core performance or we might need to run multiple redis instances on this server.
Let me know if what I'm saying doesn't make sense.
For the record I just ran a loadtest with a batch size of 1 and got the same BUSY errors as earlier around the same time.
Yes, thats definitely a problem, like it should never go up more than 80%. Multicore will not help you in this case unless you also divide the job in different queues, like one queue per available Redis node. But you can also increase the single CPU to a more performant one. If you plan to use cluster you should read this info:https://docs.bullmq.io/bull/patterns/redis-cluster
There is also more info on the DragonflyDB site, which using the same technique of key slots you can take advantage of multicore CPUs: https://www.dragonflydb.io/docs/integrations/bullmq#2-queue-naming-strategies
Btw, how many jobs are you processing per second? The reason I am asking is because out of the box Redis can handle quite a lot unless it is a very weak machine.
Using 2 machines running 56 core Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
We need to be able to safely process 4K jobs per second.
I just ran a test where we setup 2 Redis instances per server and also 2 instances of our worker app and our ingestion API per server. The ingestion API places jobs in the queue from an HTTP request and the worker picks up jobs. Each worker and ingestion API is connected to a different Redis instance.
That allows us to process at least 8K jobs/s, somewhere above that we get the BUSY error again (which make sense because we previously managed around 4K/s with a 1 Redis instance per server)
It's actually easy for us to split up our queue into multiple instances and connect them to different Redis instances. Now it's for us to decide whether we want to run the ingestion api/worker/redis separately or not. I do like the idea of running a Redis instance per server so we do not have a single point of failure. And because it only uses a single core on a 56 core machine there is enough processing room left for at least the ingestion api and possibly the workers too.
Since you are hosting your own Redis instance, have you tried with Dragonfly, with 56 cores it seems you have a huge amount of power for processing orders of magnitude more than your requirement.
It sounds strange to me that 4k jobs per sec. would saturate a standard Redis though, that is not a large amount in that hardware. Any chance you could create an example case that reproduces this saturation so that I can run it on my own and investigate it in more detail?
@jimvandervoort let me know how this issue is going.
@manast just read your reply. Thank you for the great help so far! Yep we where looking into dragonfly. For now we have a few more components we need to test and 8K is enough for us, so in order to meet our deadlines we'll run a few more tests to see what other bottlenecks we can find, then we'll get to work improving throughput.
I'd like to create a minimal reproduction example, great idea. Will get back to you when I have one this week or next :)
Btw, you can also use this simple repo to test performance: https://github.com/taskforcesh/bullmq-bench For example, in my local machine (M2 Pro) I get 46k jobs added per second using bulk, and 30k+ processed jobs per second with worker concurrency factor 100.
I’m encountering an issue with BullMQ Pro where both the worker and server are affected by the
BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE
error.I'm running 10 NodeJS workers and 1 server. Workers are started with:
Expected Behavior: I expected that this error would only cause the worker to fail temporarily without impacting the server’s ability to place events in the queue.
Actual Behavior: When the error occurs, the server also fails to place events into the queue, causing us to miss important events.
Questions:
Any help would be greatly appreciated!