Worker doesn't start processing after reconnect

tobiaslins commented 3 years ago

I am getting following error lately:

You have triggered an unhandledRejection, you may have forgotten to catch a Promise rejection:
Error: Missing Job 17568662 when trying to move from active to delayed
     at Function.moveToDelayed (/usr/app/analytics-server/node_modules/bullmq/dist/classes/scripts.js:178:23)
     at runMicrotasks (<anonymous>)
     at processTicksAndRejections (internal/process/task_queues.js:97:5)
 exited with code [1] via signal [SIGINT]
 starting in -cluster mode-
 Listening for monitoring :4443
 Error: getaddrinfo ENOTFOUND private-db-redis-XXXXXXXXX.db.ondigitalocean.com
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:66:26) {
   errno: 'ENOTFOUND',
   code: 'ENOTFOUND',
   syscall: 'getaddrinfo',
   hostname: 'ENOTFOUND private-db-redis-XXXXXXXXX.db.ondigitalocean.com'
 }

After the container restarts it can't connect to Redis anymore. Bull doesn't even recognize that it can't reconnect so the process stays alive.

What is the best way handling this?

How can I make sure that there is no unhandledRejection in that case?
Is there a way to reconnect a worker?

My healthcheck is calling queue.getWaitingCount() and this seems to work fine after the restart but the worker won't start processing again

Thanks!

Version: bullmq@1.15.1

tobiaslins commented 3 years ago

@manast Do you have any ideas/suggestions for us? We have small processing outages with manual restarts of our services to start processing again. If we can't get this resolved we need to move to another queue and I really want to stick with Bull :(

Thanks a ton!

manast commented 3 years ago

Connection with Redis is performed by ioredis which has reconnection functionality enabled by default: https://github.com/luin/ioredis I am not sure what your problem is specifically, according to the logs you posted the problem seems to be that the domain private-db-redis-XXXXXXXXX.db.ondigitalocean.com is not resolved correctly, does not seem to be a problem with BullMQ as far as I can see... seems mostly like a connectivity problem inside your infrastructure unless you can provide more details that proves it otherwise.

tobiaslins commented 3 years ago

@manast Thanks for your answer. The strange thing is that first this happens Error: Missing Job 17568662 when trying to move from active to delayed. After this, the worker crashes and the whole service restarts + can't connect anymore. How can we make sure the worker keeps processing event this error happens?

Thanks!

manast commented 3 years ago

what kind of redis instance is this? clustered, replicated with sentinel?

manast commented 3 years ago

Btw, the error Error: Missing Job 17568662 when trying to move from active to delayed is produced when a job has failed and it is scheduled for a retry with backoff, but this error should in theory never be able to occur. Maybe it happens because something goes havoc with redis and this is the symptom, then you get the connection error but maybe the connection error happened before.

tobiaslins commented 3 years ago

Managed Redis by DigitalOcean not replicated at the moment. It was running flawlessly for months and since about 1-2 weeks this is occurring. Adding a cluster should solve this? Thanks for the insights so far!

tobiaslins commented 3 years ago

Is there a way to check if the worker is still processing jobs or even better an event emitter?

I want to gracefully shut down the service when this happens.

manast commented 3 years ago

cluster will not help. The simplest setup the easier to find the root of the issue. The worker will process jobs as long as it is connected, does not seem the case from the error you posted above.

manast commented 3 years ago

If I were you I would login into the instance of the worker when this happens and check if it has connectivity.

tobiaslins commented 3 years ago

Thanks. The strange thing is that when I restart the docker container which connects to Redis, it just starts processing again. I also have a healthcheck that does await requestQueue.getWaitingCount() and the endpoint shows me the jobs queuing up. Worker is not processing anymore!

manast commented 3 years ago

ok, so the problem is not reconnection as the title of the issue, it is that despite being connected it does not process anymore right?

tobiaslins commented 3 years ago

True! Good point.

manast commented 3 years ago

I cannot find anything wrong in BullMQ, but I wonder if you are attaching an error handler to your worker?, like:

myworker.on('error', (err) => console.error(err));

Because the way NodeJS works, if you are lacking such a listener, the process will exit with an error... (yes I also think this is an awkward behaviour).

manast commented 3 years ago

https://nodejs.org/api/events.html#events_error_events

tobiaslins commented 3 years ago

Yes, I've attached it like this:

worker.on("error", (error) => console.log("Worker Error", error));

Thank you for all the help.
I'll further watch this issue and update in case I find something!

manast commented 3 years ago

@tobiaslins did you attach it now or was it attached also before when you got the error?

tobiaslins commented 3 years ago

@manast I already hat it attached before and it still crashed!

eltoroit commented 3 years ago

UPDATE It was my fault. I had a typo in the name of the QueueScheduler, the names for Worker, Queue and QueueScheduler must match.

I am testing the restart of the jobs by manually killing the process (CTRL+C) because my server will restart the job every day, I am doing a proper closing of the Queue, QueueScheduler, Worker, Redis. When I restart the process, I can query it in the active state but it does not seem to be re-processing the job.

Is there something else I have to do to get the jobs restarted?

taskforcesh / bullmq

Worker doesn't start processing after reconnect #452