Removing active jobs - Githubissues

theoilie commented 2 years ago

We have an issue where sometimes jobs get stuck in active, most likely caused by using NodeJS cluster, and we want to remove those jobs one time on startup. We're currently doing this with Promise.allSettled to skip errors because of the big warning here that Locked jobs (in active state) can not be removed. An error will be thrown:

async deleteOldActiveJobs(queue) {
    const oldActiveJobs = await queue.getJobs(['active'])
    await Promise.allSettled(oldActiveJobs.map((job) => job.remove()))
  }

I'm wondering if there's a better way to do this or if there can be a function added to a Queue's API to clear all active jobs that are in some kind of broken state (these jobs take a few seconds but some are stuck in 'active' for days and not getting processed due to some underlying issue that started after implementing NodeJS cluster).

manast commented 2 years ago

Basically, active jobs cannot get "stuck" in active, they are either being processed by a worker or they will be moved back to waiting within 30 seconds as soon as any Worker goes up online. So instead of deleting these jobs I think it is much more beneficial trying to understand why they are in the active state so that we can solve the underlying issue so that it does not happen more times.

manast commented 2 years ago

Ah, another thing, a job cannot be locked in active unless there is a worker still working on the job, as the lock needs to be renewed every 30 seconds or it will expire automatically by Redis.

manast commented 2 years ago

Something that may help in finding issues with the workers is enabling eslint rules for Promises: https://maximorlov.com/linting-rules-for-asynchronous-code-in-javascript/ Unfortunately, these rules do not include detecting if there is a code path in any of your Promises that do never resolve (leaving a worker processing a job forever).

theoilie commented 2 years ago

Thank you for all the tips - they're very appreciated. I agree - I'd like to solve the underlying issue rather than deleting jobs. It looks like possibly a worker processing a job forever as you suggested, but I'm wondering if a job be "stuck" in active if the worker finished processing it but it ran into an error when moving it to success/failed (e.g., due to "missing lock for job"). If so, that would seem to be the underlying issue with these particular jobs.

(I opened a separate issue for the "missing lock for job" errors that started happening after enabling NodeJS clustering.)

manast commented 2 years ago

It would be interesting to debug those "stucked" jobs since if they have indeed lost the lock, they should be moved back to wait automatically. Have you checked if the jobs you got the "missing lock for job" error have actually been completed or failed?

theoilie commented 2 years ago

My best guess is that the job completed successfully but errored in the moveToFinished logic (due to "missing lock"). Then it tried to do moveToFailed but ran into the same error. I could be wrong, though - is there another interpretation for this (real) example?

03:58:10: job added 03:58:12: job started processing \: "missing lock for job" in Scripts.moveToFinished 03:58:13: log recorded for "missing lock for job" in Job.moveToFailed

Log that was recorded at 03:58:13:

Error: Missing lock for job 61701. failed
    at Scripts.finishedErrors (/usr/src/app/node_modules/bullmq/dist/cjs/classes/scripts.js:206:24)
    at Job.moveToFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/job.js:343:32)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at async handleFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:320:17)
    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:455:24)

Error from bull-board's error section for this job (unknown time):

Error: Missing lock for job 61701. finished
    at Scripts.finishedErrors (/usr/src/app/node_modules/bullmq/dist/cjs/classes/scripts.js:206:24)
    at Scripts.moveToFinished (/usr/src/app/node_modules/bullmq/dist/cjs/classes/scripts.js:192:24)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at async handleCompleted (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:313:31)
    at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:334:20)
    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:455:24)

The job's data as reported by bull-board (note the null returnvalue):

{
  "data": {
      // <removed> nothing special here
   },
  "returnValue": null
}

Code for each important line in the stack traces above (mapping of local line to line in GH project):

/usr/src/app/node_modules/bullmq/dist/cjs/classes/scripts.js:206: https://github.com/taskforcesh/bullmq/blob/d0b3412d222449c24ab36068a791d08ea19ed922/src/classes/scripts.ts#L356
/usr/src/app/node_modules/bullmq/dist/cjs/classes/job.js:343: https://github.com/taskforcesh/bullmq/blob/d0b3412d222449c24ab36068a791d08ea19ed922/src/classes/job.ts#L610
/usr/src/app/node_modules/bullmq/dist/cjs/classes/scripts.js:192: https://github.com/taskforcesh/bullmq/blob/d0b3412d222449c24ab36068a791d08ea19ed922/src/classes/scripts.ts#L336 (Notice how it was in moveToFinished first. This is what makes me think the job finished successfully but ran into the missing lock error when trying to move it from active to completed.)

manast commented 2 years ago

What Redis hosting service are you using and what is your setting for maxmemory-policy in redis.conf?

manast commented 2 years ago

Btw, moveToFinish is the same script both for completing and failing. On the other hand it can be seen in the callstack that the job processed successfully and then it called "handleCompleted". Another interesting thing is that it seems like the job just took 1 second to complete, and the lock of a job has a default of 30 seconds before expiration (unless you changed this setting), so it is very strange that the lock would just disappear like that. Thats why I wonder which setting you have on maxmemory-policy, to make sure it is not Redis that is just evicting keys in some low memory situation.

manast commented 2 years ago

I am going to introduce a couple of new error messages to help us debug this kind of situations.

manast commented 2 years ago

https://github.com/taskforcesh/bullmq/pull/1514

theoilie commented 2 years ago

What Redis hosting service are you using and what is your setting for maxmemory-policy in redis.conf?

Redis is colocated with the application as a Docker container. This setup is replicated on various server hosting companies including AWS and GCP and not seeing any differences between them. I spot checked a number of these servers to verify that the result of CONFIG GET maxmemory-policy is noeviction.

Btw, moveToFinish is the same script both for completing and failing. On the other hand it can be seen in the callstack that the job processed successfully and then it called "handleCompleted". Another interesting thing is that it seems like the job just took 1 second to complete, and the lock of a job has a default of 30 seconds before expiration (unless you changed this setting), so it is very strange that the lock would just disappear like that. Thats why I wonder which setting you have on maxmemory-policy, to make sure it is not Redis that is just evicting keys in some low memory situation.

Ah that's interesting. I wonder if there could be a race condition where one process picks up the job but another process removes the lock before the first process is done with it?

I am going to introduce a couple of new error messages to help us debug this kind of situations.

Perfect, thank you. I'll have to eventually migrate to v3 and get those changes.

manast commented 2 years ago

Ah that's interesting. I wonder if there could be a race condition where one process picks up the job but another process removes the lock before the first process is done with it?

Quite unlikely, especially in the case where the job durations are so short. If they were more than 30 seconds, then we could maybe talk about the possibility of jobs getting stalled...

Let me know how it goes. The best would be if the issue could be reproduced by us, then we could find the problem in no time.

taskforcesh / bullmq

Removing active jobs #1506