Jobs not getting removed properly

nlochschmidt commented 1 year ago

Background: Our use-case is that we have implemented a polling mechanism and we want to rely on jobIds to make sure we don't poll the same thing concurrently. We have been using bull so far and are looking to switch over to BullMQ Pro after this discussion https://github.com/OptimalBits/bull/issues/2594. We have now transitioned 1% of our workload to BullMQ Pro and are unfortunately seeing a similar issue to the one linked 🙈.

What we experience is a slow built-up of stuck jobs in production. The jobs can be found in redis under the key but are in none of the queues.

There are two types of failed jobs:

Type 1:

  failedReason: "Missing key for job de46cb88-1d41-4c05-b774-ec34d2fce714-polling. finished"
  stacktrace: "[\"Error: Missing key for job de46cb88-1d41-4c05-b774-ec34d2fce714-polling. finished\\n    at ScriptsPro.finishedErrors (/root/common/temp/node_modules/.pnpm/bullmq@3.10.2/node_modules/bullmq/dist/cjs/classes/scripts.js:216:24)\\n    at ScriptsPro.moveToFinished (/root/common/temp/node_modules/.pnpm/@taskforcesh+bullmq-pro@5.2.0/node_modules/@taskforcesh/bullmq-pro/dist/cjs/classes/scripts-pro.js:134:24)\\n    at runMicrotasks (<anonymous>)\\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\\n    at async handleCompleted (/root/common/temp/node_modules/.pnpm/bullmq@3.10.2/node_modules/bullmq/dist/cjs/classes/worker.js:313:35)\\n    at async WorkerPro.processJob (/root/common/temp/node_modules/.pnpm/bullmq@3.10.2/node_modules/bullmq/dist/cjs/classes/worker.js:347:20)\\n    at async WorkerPro.retryIfFailed (/root/common/temp/node_modules/.pnpm/bullmq@3.10.2/node_modules/bullmq/dist/cjs/classes/worker.js:520:24)\"]"

Type 2:

  attemptsMade: "1"
  failedReason: "1fed106e-2de8-4d98-bdaf-b60ec545b3d7-polling is not in the active state. finished"
  stacktrace: "[\"Error: Job 1fed106e-2de8-4d98-bdaf-b60ec545b3d7-polling is not in the active state. finished\\n    at ScriptsPro.finishedErrors (/root/common/temp/node_modules/.pnpm/bullmq@3.10.2/node_modules/bullmq/dist/cjs/classes/scripts.js:220:24)\\n    at ScriptsPro.moveToFinished (/root/common/temp/node_modules/.pnpm/@taskforcesh+bullmq-pro@5.2.0/node_modules/@taskforcesh/bullmq-pro/dist/cjs/classes/scripts-pro.js:134:24)\\n    at runMicrotasks (<anonymous>)\\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\\n    at async handleCompleted (/root/common/temp/node_modules/.pnpm/bullmq@3.10.2/node_modules/bullmq/dist/cjs/classes/worker.js:313:35)\\n    at async WorkerPro.processJob (/root/common/temp/node_modules/.pnpm/bullmq@3.10.2/node_modules/bullmq/dist/cjs/classes/worker.js:347:20)\\n    at async WorkerPro.retryIfFailed (/root/common/temp/node_modules/.pnpm/bullmq@3.10.2/node_modules/bullmq/dist/cjs/classes/worker.js:520:24)\"]"
  processedOn: "1684803177644"

I haven't been able to reproduce this behavior with a minimal test case just yet, but I was hoping that maybe someone can see what's going on based on the provided information.

Our queue configuration is:

{
   prefix: "bullmq-pro" // because we don't want to conflict with the legacy bull still running in parallel
}

Our worker configuration is:

{ 
    prefix: "bullmq-pro"
    concurrency: 60,
    group: {
      concurrency: 1,
    },
    limiter: {
        max: 1000,
        duration: 1000
    }
}

and we schedule jobs with these options:

{
    name: "polling",
    data,
    opts: {
      group: {
        id: `${userId}`,
      },
      jobId: `${id}-polling`,
      removeOnComplete: true,
      removeOnFail: true,
    },
}

BullMQ Pro version 5.2.0

Help would be very much appreciated 🙏

P.S: This might or might not be a duplicate of https://github.com/taskforcesh/bullmq/issues/1662 however we are using BullMQ Pro and the Groups feature, so I decided to open a new report here instead.

nlochschmidt commented 1 year ago

Filed in the wrong repository 🙈. Please transfer it over to https://github.com/taskforcesh/bullmq-pro-support 🙏

manast commented 1 year ago

This seems to be a typical case where jobs are getting stalled due to CPU exhaustion, either because there is too much job going on the same machine or because the jobs are keeping the event loop busy for too long (more than the default 30 seconds). Another alternative is an incorrect configuration of Redis where the max memory policy is not set to "no-eviction". Here are some links with additional information: https://docs.bullmq.io/guide/workers/sandboxed-processors https://docs.bullmq.io/guide/going-to-production#max-memory-policy

nlochschmidt commented 1 year ago

max-memory-policy is set to noeviction and we followed the going-to-production guide. Haven't used sandboxed workers yet, so that might be an option to try.

What is puzzling me though is that what I posted above are the entire hashes at those keys. There is nothing else there. No name, no data no opts,...

How can it happen that we get a hash at a key that only contains failedReason and stacktrace. Looking at the code using atomicity of MULTI/EXEC and Lua scripts I don't see how this can happen because of a stalled worker or a busy event loop 🤔

manast commented 1 year ago

What do you mean by the sentence below, can you elaborate a bit more?

How can it happen that we get a hash at a key that only contains failedReason and stacktrace.

manast commented 1 year ago

Ok, if what you mean is that you have examined the job key and there are only 2 fields, failedReason and stacktrace, the explanation would be that the job has already completed, removed automatically, but some worker that was working on the job before got stalled, finally completed the process and when it tried to move the job to completed it failed with the errors above (since the job was already completed by a different workers), and since the job was removed previously, the only that remains are the two last fields written to the job hash.

manast commented 1 year ago

What kind of processing are you doing, is it very CPU intensive?

nlochschmidt commented 1 year ago

But saveStacktrace does an EXISTS check on the job id key so if the job was already processed and removed by another worker it shouldn't write those fields, or should it?

What kind of processing are you doing, is it very CPU intensive?

Not really. The workers are calling external APIs and are mostly waiting for their response and then storing things in the database, again waiting mostly on the database. CPU utilization is around 0.25-0.4 cores. Far away from the 2 cores we reserved for each of them.

manast commented 1 year ago

But saveStacktrace does an EXISTS check on the job id key so if the job was already processed and removed by another worker it shouldn't write those fields, or should it?

ah ok, then I don't know. It was a theory of what I thought was happening. If you are not running CPU expensive jobs then you should not get stalled jobs unless the worker dies unexpectedly.

nlochschmidt commented 1 year ago

Hmm so do you think this might be a bug?

In my experience we can never fully prevent services from dying unexpectedly, or the network having issues for that matter.

Unless we are using BullMQ wrong, my expectation would be that even with stalled jobs and workers dying, the removeOnFail: true and removeOnComplete: true should eventually cause jobs to get reliably removed.

If we can't rely on jobs to be removed, we would need to think about some kind of cron job that looks through all jobs periodically and delete the ones that look suspicious. However I would rather not interfere with the inner workings of BullMQ like this.

manast commented 1 year ago

In my experience we can never fully prevent services from dying unexpectedly, or the network having issues for that matter.

That's why we have the stalled check mechanism to begin with, to handle unexpected cases. I do not know if it is a bug or not, but it seems like there is something special going on in your setup since this behavior has not been reported before. I cannot rule out that there is a bug in BullMQ that in some edge cases would end in an empty job consisting only of a failedReason field, but even in that case it would be pretty uncommon, it is quite suspicious if this is something that happens so often in your setup that it is creating a problem.

nlochschmidt commented 1 year ago

Hmm, could be something in our setup, of course.

That said, I just had another look at https://github.com/taskforcesh/bullmq/issues/1662 and in the screenshot of the stuck job there are also only the two fields.

I think the issue we are experiencing might actually be a duplicate of that one after all?

roggervalf commented 1 year ago

hey @nlochschmidt, I think I found the root cause of this issue, I opened a pr for it, we were using the old saveStacktrace function instead of the new saveStacktrace function on pro version so now with my pr it should use the expected script

nlochschmidt commented 1 year ago

Thanks @roggervalf.

I guess this will also fix the second type of stuck jobs that says Job xyz is not in the active state. finished, right?

Looking forward to the next release 🙏

roggervalf commented 1 year ago

Hi @nlochschmidt, yeah the last fix should also handle this case as in this test case https://github.com/taskforcesh/bullmq/pull/1918, removed jobs should not save stacktrace

manast commented 1 year ago

@nlochschmidt the fix was already released yesterday, just upgrade to version v5.3.2

nlochschmidt commented 1 year ago

Alright. I didn't see it in the changelog, that's why I asked https://docs.bullmq.io/bullmq-pro/changelog

taskforcesh / bullmq-pro-support

Jobs not getting removed properly #49