Workers stop processing jobs after Redis reconnect

taskforcesh / bullmq

BullMQ - Message Queue and Batch processing for NodeJS and Python based on Redis

https://bullmq.io

MIT License

6.09k stars 400 forks source link

Workers stop processing jobs after Redis reconnect #648

Closed akramali86 closed 3 years ago

akramali86 commented 3 years ago

In production we're using Amazon Elasticache with BullMQ^1.34.2

We're finding that in the event of a failover the following error is emitted by the workers UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?) and workers stop processing jobs. But jobs are still able to be queued.

Currently we have to redeploy our app to rectify this issue. Is there anything we can do to handle this error so that when Redis reconnects it can start processing jobs again? Thanks.

manast commented 3 years ago

Yeah, I think i know why this happens. There is a loop inside BullMQ that throws an exception in this case and stops looping. We have a fix in older Bull that I can port to BullMQ that should resolve the issue though.

akramali86 commented 3 years ago

thanks @manast. In the meantime would you see any issue with us manually calling the run method in an interval to restart the worker? I know it's not very elegant, but it seems to work. Just wondering if it would cause any memory leaks etc.

Example:

const { Worker } = require('bullmq');

const worker = new Worker('worker');

setInterval(() => {
    if (!worker.running && !worker.closing) {
        console.log('Restarting worker');
        worker.run().catch(() => {
            worker.running = false;
            console.log('Could not restart worker');
        });
    }
}, 60000);

manast commented 3 years ago

I do not see any issue with the naked eye, it should work.

sven-codeculture commented 3 years ago

We were having the same issue some days ago (the worker just died without any notification) but after intrucing the isRunning() on the worker everything seems to work fine for us when we just restart the workers (however, we do it in kubernetes via health check fail, and restart the worker pod if it dies.)

manast commented 3 years ago

@sven-codeculture what do you mean with "it died without any notification" ?

github-actions[bot] commented 3 years ago

:tada: This issue has been resolved in version 1.40.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket: