[Bug]: Worker stopped processing jobs, and mostly delayed Jobs

wernermorgenstern commented 8 months ago

Version

v5.4.2

Platform

NodeJS

What happened?

We have a service, where a worker runs, and processes jobs. After the processing is done, it will create another job, which is delayed (around 64 minutes). Today, I noticed that the service and worker stopped processing jobs. There were no error messages in the logs. When I used BullBoard (I use it as a UI to see jobs), I saw the jobs were still in the delayed state, and like 24 hours overdue.

When I restarted the service, and the worker started, it immediately started processing those delayed jobs. This is not the first it happened. Today I though first checked the delayed jobs.

In today's incident, the service has been running for 4 days.

We run in EKS on AWS (NodeJS service, using Typescript). I use BullMQ Pro. And we are using Groups and each Group has a concurrency set to 1.

How to reproduce.

I don't have any test code for this

Relevant log output

No Logs or error logs were produced

Code of Conduct

[X] I agree to follow this project's Code of Conduct

lukas-becker0 commented 5 months ago

I think we encountered this issue today,

we are using bullmq 5.7.8 and ioredis 5.4.1 and today we noticed that for some reason no job was being processed.

We first thought that it might be a redis issue, we did not do a thoroughly investigation yet but after restarting the service we noticed that the jobs were processed as expected which means that adding the jobs worked but the worker did not process them.

This is the first time we encountered this issue.

Aside from maxRetriesPerRequest: null we are using lazyConnect: true and we are currently not settingenableReadyCheck at all (should we set it to false ?).

lukas-becker0 commented 5 months ago

So next time we have a similar issue, and I notice the worker is not connected, what would the next steps be in troubleshooting, and resolving this issue?

If you use the suggested settings, the workers should automatically reconnect as soon they can, so you should not get this issue anymore.

Is is really necessary to explicitly set enableOfflineQueue: true for the workers in order for them to reconnect ? So far this was not necessary.

You can use https://api.docs.bullmq.io/classes/v5.Queue.html#getWorkers or Taskforce.sh, I don't know if BullBoard can also show this information.

I connected to our Q Redis instance and checked the workers & worker count of the deployed queue locally (Since the queues of the Q stage also stopped processing jobs) and indeed the workers of our Q deployment are apparently no longer connected, the worker count for the queue is 0.

lukas-becker0 commented 5 months ago

I connected to our Q Redis instance and checked the workers & worker count of the deployed queue locally (Since the queues of the Q stage also stopped processing jobs) and indeed the workers of our Q deployment are apparently no longer connected, the worker count for the queue is 0.

Apparently the managed redis instances were restarted prior to the issue. About two hours later the queue did accept new jobs again, the worker however stopped processing jobs (We do not know yet why the worker stopped, we use the same options (.e.g. infinite retries) for both the queue and the worker.

lukas-becker0 commented 5 months ago

Hi, @manast I did some tests with the code you provided.

If I force close the redis connections with e.g. redis-cli CLIENT KILL TYPE normal && redis-cli CLIENT KILL TYPE slave && redis-cli CLIENT KILL TYPE pubsub the worker reconnects and continues to process jobs as expected.

If however I kill/shutdown the entire redis server docker-compose down, then wait a few seconds (~5 seconds) and then start redis again (docker-compose up -d), the worker reconnects but does not continue to process jobs (although isRunning returns true) from the queue. The queue does then no longer see any connected worker.

(Simulating a server restart and/or crash)

Is that the expected behaviour ?

docker-compose.yaml

version: '3.8'

services:
    redis:
        container_name: 'redis_test_server'
        image: redis:6.2.14-alpine
        restart: always
        ports:
            - '6379:6379'
        volumes:
            -   /tmp/redis-test-server:/data

(Setting redis-cli config set maxmemory-policy noeviction after the first start ofcourse)

I modified your code slightly:

const { Queue, Worker } = require('bullmq');

const queueName = 'test';

async function start() {
    const queue = new Queue(queueName, {
        connection: { host: 'localhost', port: 6379 },
        // a warning is thrown on redis startup if these aren't added
        enableReadyCheck: false,
        maxRetriesPerRequest: null,
        enableOfflineQueue: false,
    });

    setInterval(() => {
        queue.getWorkersCount().then((numberOfWorkers) => {
            console.warn(`Number of workers: ${numberOfWorkers}`);
        });

        queue.getJobCounts().then((numberOfJobs) => {
            console.warn(`Number of jobs: ${JSON.stringify(numberOfJobs)}`);
        });
    }, 10_000);

    const job = await queue.add('__default__', null, {
        jobId: queueName + '-cron-worker-job',
        repeat: {
            every: 3000, // every 3 seconds
        },
        data: {
            foo: 'bar',
        },
    });

    const processFn = async (job) => {
        console.log(`Processing job ${job.id} with data ${job.data}`);
        console.log(`-> ${job.id}`);
        await new Promise((res) => setTimeout(res, 1000));
        console.log(`\t<- ${job.id}`);
    };

    const worker = new Worker(queueName, processFn, {
        connection: {
            host: 'localhost',
            port: 6379, // a warning is thrown on redis startup if these aren't added
            enableReadyCheck: false,
            maxRetriesPerRequest: null,
            enableOfflineQueue: true,
        },
    });

    worker.on('error', (err) => {
        console.error(err);
    });

    worker.on('closed', () => {
        console.warn('Worker closed');
    });

    worker.on('ready', () => {
        console.warn('Worker is ready!');
    });

    worker.on('completed', (job) => {
        console.log(`Job ${job.id} completed`);
    });

    worker.on('failed', (job, err) => {
        console.error(`Job ${job.id} failed with ${err.message}`);
    });
}

start();

lukas-becker0 commented 5 months ago

I can reproduce the above with bullMQ starting with version 5.0.0, bullMQ 4.17.0 works as expected.

manast commented 5 months ago

@lukas-becker0 seems like I am able to reproduce it following your instructions. I will keep you updated...

manast commented 5 months ago

I hope this small fix finally resolves this issue for everybody.

lukas-becker0 commented 5 months ago

Hi @manast,

I'm sorry but I'm still able to reproduce it with the fix and bullmq 5.7.13

Assuming it is enough to add the line from the fix PR to worker.js in bullmq/dist/cjs/classes/worker.js ?

Sometimes the worker can connect again but when I then restart redis for a second or third time it eventually results in the same issue as before.

I also tried with a custom redis.conf with the recommended AOF option being set from the start but it made no difference.

lukas-becker0 commented 5 months ago

FYI, @manast I created a repo with my current test setup here, just in case it might be helpful.

manast commented 5 months ago

It is not enough adding the line with the disconnect, you must remove the other 2 as well.

lukas-becker0 commented 5 months ago

@manast

It is not enough adding the line with the disconnect, you must remove the other 2 as well.

Sorry you are right, I somehow missed that :see_no_evil: ... (due to the darkreader firefox extension ....)

I just did run the tests again and it does indeed work now as expected. Thank you very much and sorry for the confusion. :fireworks: :smiley:

croconut commented 5 months ago

Really glad this got fixed, literally just ran into this issue today during POC testing and upgrading to 5.7.14 did the trick. Thanks @manast

wernermorgenstern commented 5 months ago

Is this already part of the latest pro version too or is that still coming?

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Matthew Stevens @.> Sent: Wednesday, May 29, 2024 7:42:01 PM To: taskforcesh/bullmq @.> Cc: Werner Morgenstern @.>; Mention @.> Subject: Re: [taskforcesh/bullmq] [Bug]: Worker stopped processing jobs, and mostly delayed Jobs (Issue #2466)

Really glad this got fixed, literally just ran into this issue today during POC testing. Thanks @manasthttps://github.com/manast

— Reply to this email directly, view it on GitHubhttps://github.com/taskforcesh/bullmq/issues/2466#issuecomment-2138423018, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJPN7C4PKET7YZH7ICFCKXTZEZRUTAVCNFSM6AAAAABEM752DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZYGQZDGMBRHA. You are receiving this because you were mentioned.Message ID: @.***>

manast commented 5 months ago

@wernermorgenstern its coming very soon.

roggervalf commented 5 months ago

hi @wernermorgenstern, it's available since v7.8.2 in pro version

matthewgonzalez commented 5 months ago

We will attempt the update next week and report back.

tavindev commented 4 months ago

Experiencing this in v5.8.1

manast commented 4 months ago

Since it seems that the original authors cannot reproduce the issue anymore, I will close it now, so that other users do not get lured into this one if they are experiencing a similar issue, but not this exact one, as that will just confuse everybody.

@tavindev you are welcome to open a new issue with the particular details for your use case.

lukas-becker0 commented 4 months ago

@tavindev I used my test setup from last time and tested with v5.8.2 and I can not reproduce it.

taskforcesh / bullmq