Closed wernermorgenstern closed 4 months ago
I think we encountered this issue today,
we are using bullmq 5.7.8
and ioredis 5.4.1
and today we noticed that for some reason no job was being processed.
We first thought that it might be a redis issue, we did not do a thoroughly investigation yet but after restarting the service we noticed that the jobs were processed as expected which means that adding the jobs worked but the worker did not process them.
This is the first time we encountered this issue.
Aside from maxRetriesPerRequest: null
we are using lazyConnect: true
and we are currently not settingenableReadyCheck
at all (should we set it to false ?).
So next time we have a similar issue, and I notice the worker is not connected, what would the next steps be in troubleshooting, and resolving this issue?
If you use the suggested settings, the workers should automatically reconnect as soon they can, so you should not get this issue anymore.
Is is really necessary to explicitly set enableOfflineQueue: true
for the workers in order for them to reconnect ?
So far this was not necessary.
You can use https://api.docs.bullmq.io/classes/v5.Queue.html#getWorkers or Taskforce.sh, I don't know if BullBoard can also show this information.
I connected to our Q Redis instance and checked the workers & worker count of the deployed queue locally (Since the queues of the Q stage also stopped processing jobs) and indeed the workers of our Q deployment are apparently no longer connected, the worker count for the queue is 0.
I connected to our Q Redis instance and checked the workers & worker count of the deployed queue locally (Since the queues of the Q stage also stopped processing jobs) and indeed the workers of our Q deployment are apparently no longer connected, the worker count for the queue is 0.
Apparently the managed redis instances were restarted prior to the issue. About two hours later the queue did accept new jobs again, the worker however stopped processing jobs (We do not know yet why the worker stopped, we use the same options (.e.g. infinite retries) for both the queue and the worker.
Hi, @manast I did some tests with the code you provided.
If I force close the redis connections with e.g.
redis-cli CLIENT KILL TYPE normal && redis-cli CLIENT KILL TYPE slave && redis-cli CLIENT KILL TYPE pubsub
the worker reconnects and continues to process jobs as expected.
If however I kill/shutdown the entire redis server docker-compose down
, then wait a few seconds (~5 seconds) and then start redis again (docker-compose up -d
),
the worker reconnects but does not continue to process jobs (although isRunning
returns true) from the queue.
The queue does then no longer see any connected worker.
(Simulating a server restart and/or crash)
Is that the expected behaviour ?
docker-compose.yaml
version: '3.8'
services:
redis:
container_name: 'redis_test_server'
image: redis:6.2.14-alpine
restart: always
ports:
- '6379:6379'
volumes:
- /tmp/redis-test-server:/data
(Setting redis-cli config set maxmemory-policy noeviction
after the first start ofcourse)
I modified your code slightly:
const { Queue, Worker } = require('bullmq');
const queueName = 'test';
async function start() {
const queue = new Queue(queueName, {
connection: { host: 'localhost', port: 6379 },
// a warning is thrown on redis startup if these aren't added
enableReadyCheck: false,
maxRetriesPerRequest: null,
enableOfflineQueue: false,
});
setInterval(() => {
queue.getWorkersCount().then((numberOfWorkers) => {
console.warn(`Number of workers: ${numberOfWorkers}`);
});
queue.getJobCounts().then((numberOfJobs) => {
console.warn(`Number of jobs: ${JSON.stringify(numberOfJobs)}`);
});
}, 10_000);
const job = await queue.add('__default__', null, {
jobId: queueName + '-cron-worker-job',
repeat: {
every: 3000, // every 3 seconds
},
data: {
foo: 'bar',
},
});
const processFn = async (job) => {
console.log(`Processing job ${job.id} with data ${job.data}`);
console.log(`-> ${job.id}`);
await new Promise((res) => setTimeout(res, 1000));
console.log(`\t<- ${job.id}`);
};
const worker = new Worker(queueName, processFn, {
connection: {
host: 'localhost',
port: 6379, // a warning is thrown on redis startup if these aren't added
enableReadyCheck: false,
maxRetriesPerRequest: null,
enableOfflineQueue: true,
},
});
worker.on('error', (err) => {
console.error(err);
});
worker.on('closed', () => {
console.warn('Worker closed');
});
worker.on('ready', () => {
console.warn('Worker is ready!');
});
worker.on('completed', (job) => {
console.log(`Job ${job.id} completed`);
});
worker.on('failed', (job, err) => {
console.error(`Job ${job.id} failed with ${err.message}`);
});
}
start();
I can reproduce the above with bullMQ starting with version 5.0.0
,
bullMQ 4.17.0
works as expected.
@lukas-becker0 seems like I am able to reproduce it following your instructions. I will keep you updated...
I hope this small fix finally resolves this issue for everybody.
Hi @manast,
I'm sorry but I'm still able to reproduce it with the fix and bullmq 5.7.13
Assuming it is enough to add the line from the fix PR to worker.js
in bullmq/dist/cjs/classes/worker.js
?
Sometimes the worker can connect again but when I then restart redis for a second or third time it eventually results in the same issue as before.
I also tried with a custom redis.conf
with the recommended AOF option being set from the start but it made no difference.
FYI, @manast I created a repo with my current test setup here, just in case it might be helpful.
It is not enough adding the line with the disconnect, you must remove the other 2 as well.
@manast
It is not enough adding the line with the disconnect, you must remove the other 2 as well.
Sorry you are right, I somehow missed that :see_no_evil: ... (due to the darkreader firefox extension ....)
I just did run the tests again and it does indeed work now as expected. Thank you very much and sorry for the confusion. :fireworks: :smiley:
Really glad this got fixed, literally just ran into this issue today during POC testing and upgrading to 5.7.14 did the trick. Thanks @manast
Is this already part of the latest pro version too or is that still coming?
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Matthew Stevens @.> Sent: Wednesday, May 29, 2024 7:42:01 PM To: taskforcesh/bullmq @.> Cc: Werner Morgenstern @.>; Mention @.> Subject: Re: [taskforcesh/bullmq] [Bug]: Worker stopped processing jobs, and mostly delayed Jobs (Issue #2466)
Really glad this got fixed, literally just ran into this issue today during POC testing. Thanks @manasthttps://github.com/manast
— Reply to this email directly, view it on GitHubhttps://github.com/taskforcesh/bullmq/issues/2466#issuecomment-2138423018, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJPN7C4PKET7YZH7ICFCKXTZEZRUTAVCNFSM6AAAAABEM752DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZYGQZDGMBRHA. You are receiving this because you were mentioned.Message ID: @.***>
@wernermorgenstern its coming very soon.
hi @wernermorgenstern, it's available since v7.8.2 in pro version
We will attempt the update next week and report back.
Experiencing this in v5.8.1
Since it seems that the original authors cannot reproduce the issue anymore, I will close it now, so that other users do not get lured into this one if they are experiencing a similar issue, but not this exact one, as that will just confuse everybody.
@tavindev you are welcome to open a new issue with the particular details for your use case.
@tavindev I used my test setup from last time and tested with v5.8.2 and I can not reproduce it.
Version
v5.4.2
Platform
NodeJS
What happened?
We have a service, where a worker runs, and processes jobs. After the processing is done, it will create another job, which is delayed (around 64 minutes). Today, I noticed that the service and worker stopped processing jobs. There were no error messages in the logs. When I used BullBoard (I use it as a UI to see jobs), I saw the jobs were still in the delayed state, and like 24 hours overdue.
When I restarted the service, and the worker started, it immediately started processing those delayed jobs. This is not the first it happened. Today I though first checked the delayed jobs.
In today's incident, the service has been running for 4 days.
We run in EKS on AWS (NodeJS service, using Typescript). I use BullMQ Pro. And we are using Groups and each Group has a concurrency set to 1.
How to reproduce.
I don't have any test code for this
Relevant log output
Code of Conduct