taskforcesh / bullmq

BullMQ - Message Queue and Batch processing for NodeJS and Python based on Redis
https://bullmq.io
MIT License
6.01k stars 387 forks source link

[Bug]: Job data not passed to worker, all queue jobs removed and excessive amount of requests sent to redis server #2763

Open samundrak opened 3 weeks ago

samundrak commented 3 weeks ago

Version

v3.10.3

Platform

NodeJS

What happened?

I have been facing this issue for a week now. A few changes I made included increasing the delay and adding a limiter. After these changes, I started encountering issues (Not sure if this is the reason), such as the job data not being passed to the worker—it was basically empty. To fix this temporarily, I removed a few queues, but another problem arose: all the queue data was suddenly being removed, which is causing serious issues in production.

I couldn't pinpoint the problem, so I switched from AWS ElastiCache to a self-hosted Redis to ensure the settings were configured correctly according to the documentation. It worked well for a few days, but then the issue of the queue being automatically removed started again. I did some debugging, checked the logs using RedisInsight, and discovered that an excessive number of requests were being sent to the Redis server.

Framework: NestJS ^9.0.0

https://github.com/user-attachments/assets/e0aeb8fd-7dc9-451d-bfed-2c42eb2649d2

How to reproduce.

Not able to reproduce it on dev environment or in local environment.

Relevant log output

No response

Code of Conduct

manast commented 3 weeks ago

The number of requests are probably just normal. Seems like you having several issues and you are conflating them which makes it more difficult to solve them. I suggest you take every issue as separate things. For example, jobs with empty data, that would be one thing, try to isolate the problem, most likely this issue is in your own code, set debug logs an try to figure out if you really are setting the data before adding the jobs. All queue data removed also sounds like you have some code that is removing the queues, maybe some test/debug leftovers code, or maybe you are not configuring the maxmemory policy of your queue appropriately, although this normally would not result in all data removed.

samundrak commented 3 weeks ago

@manast Thank you for the response.

So far, I don't have any explicit code to remove items from the queue, and the only queue removal setting is the default one, which is done after completion or failure by Bull. The job data being removed might have been resolved after I refactored my code from the NestJS process decorator to an explicit worker class, but the issue of queue data getting removed remains frequent. We can't add any items to the queue, as they are removed immediately. I thought the excessive number of requests could also be the cause, as it was sending many DEL commands to the Redis server when I used throttle/limiter. Additionally, the amount of requests seems normal when I check my development environment. However, even when there's no load, the number of requests sent to Redis in production is still quite high.

// Request sent to Redis when throttled image

// Worker Implementation image

// Queue settings

    queue: {
      removeOnComplete: {
        age: 3600 * 12, // keep up to 1 hour
        count: 1000, // keep up to 1000 jobs
      },
      removeOnFail: {
        age: 48 * 3600, // keep up to 48 hours
      },
      delay: 5000,
      attempts: 3,
      backoff: {
        type: 'exponential',
        delay: 60000,
      },
    },
manast commented 3 weeks ago

I am quite confident the issue with missing data is not a bug in BullMQ.

roggervalf commented 3 weeks ago

Hi @samundrak, you are using quite old version, could you pls try to use the latest one and let us know

samundrak commented 3 weeks ago

Thank you for the response

@manast I was thinking same for most of the the time, but I couldn't find place in the implementation that could be the reason for all queue data being removed. One thing I did recently was to remove the throttle in worker settings and since then it hasn't occurred but am still giving some time to confirm.

@roggervalf I am not sure about the old version being an issue but I think now I should work on updating it.

I will update you on the status once I update the versions. Thank you for the help