Open tobiaslins opened 3 years ago
@manast Do you have any ideas/suggestions for us? We have small processing outages with manual restarts of our services to start processing again. If we can't get this resolved we need to move to another queue and I really want to stick with Bull :(
Thanks a ton!
Connection with Redis is performed by ioredis which has reconnection functionality enabled by default: https://github.com/luin/ioredis I am not sure what your problem is specifically, according to the logs you posted the problem seems to be that the domain private-db-redis-XXXXXXXXX.db.ondigitalocean.com is not resolved correctly, does not seem to be a problem with BullMQ as far as I can see... seems mostly like a connectivity problem inside your infrastructure unless you can provide more details that proves it otherwise.
@manast Thanks for your answer. The strange thing is that first this happens Error: Missing Job 17568662 when trying to move from active to delayed
. After this, the worker crashes and the whole service restarts + can't connect anymore.
How can we make sure the worker keeps processing event this error happens?
Thanks!
what kind of redis instance is this? clustered, replicated with sentinel?
Btw, the error Error: Missing Job 17568662 when trying to move from active to delayed
is produced when a job has failed and it is scheduled for a retry with backoff, but this error should in theory never be able to occur. Maybe it happens because something goes havoc with redis and this is the symptom, then you get the connection error but maybe the connection error happened before.
Managed Redis by DigitalOcean not replicated at the moment. It was running flawlessly for months and since about 1-2 weeks this is occurring. Adding a cluster should solve this? Thanks for the insights so far!
Is there a way to check if the worker is still processing jobs or even better an event emitter?
I want to gracefully shut down the service when this happens.
cluster will not help. The simplest setup the easier to find the root of the issue. The worker will process jobs as long as it is connected, does not seem the case from the error you posted above.
If I were you I would login into the instance of the worker when this happens and check if it has connectivity.
Thanks. The strange thing is that when I restart the docker container which connects to Redis, it just starts processing again.
I also have a healthcheck that does await requestQueue.getWaitingCount()
and the endpoint shows me the jobs queuing up. Worker is not processing anymore!
ok, so the problem is not reconnection as the title of the issue, it is that despite being connected it does not process anymore right?
True! Good point.
I cannot find anything wrong in BullMQ, but I wonder if you are attaching an error handler to your worker?, like:
myworker.on('error', (err) => console.error(err));
Because the way NodeJS works, if you are lacking such a listener, the process will exit with an error... (yes I also think this is an awkward behaviour).
Yes, I've attached it like this:
worker.on("error", (error) => console.log("Worker Error", error));
Thank you for all the help.
I'll further watch this issue and update in case I find something!
@tobiaslins did you attach it now or was it attached also before when you got the error?
@manast I already hat it attached before and it still crashed!
UPDATE It was my fault. I had a typo in the name of the QueueScheduler, the names for Worker, Queue and QueueScheduler must match.
I am testing the restart of the jobs by manually killing the process (CTRL+C) because my server will restart the job every day, I am doing a proper closing of the Queue, QueueScheduler, Worker, Redis. When I restart the process, I can query it in the active state but it does not seem to be re-processing the job.
Is there something else I have to do to get the jobs restarted?
I am getting following error lately:
After the container restarts it can't connect to Redis anymore. Bull doesn't even recognize that it can't reconnect so the process stays alive.
What is the best way handling this?
My healthcheck is calling queue.getWaitingCount() and this seems to work fine after the restart but the worker won't start processing again
Thanks!
Version: bullmq@1.15.1