Closed nlochschmidt closed 1 year ago
Filed in the wrong repository 🙈. Please transfer it over to https://github.com/taskforcesh/bullmq-pro-support 🙏
This seems to be a typical case where jobs are getting stalled due to CPU exhaustion, either because there is too much job going on the same machine or because the jobs are keeping the event loop busy for too long (more than the default 30 seconds). Another alternative is an incorrect configuration of Redis where the max memory policy is not set to "no-eviction". Here are some links with additional information: https://docs.bullmq.io/guide/workers/sandboxed-processors https://docs.bullmq.io/guide/going-to-production#max-memory-policy
max-memory-policy is set to noeviction and we followed the going-to-production guide. Haven't used sandboxed workers yet, so that might be an option to try.
What is puzzling me though is that what I posted above are the entire hashes at those keys. There is nothing else there. No name
, no data
no opts
,...
How can it happen that we get a hash at a key that only contains failedReason
and stacktrace
. Looking at the code using atomicity of MULTI/EXEC and Lua scripts I don't see how this can happen because of a stalled worker or a busy event loop 🤔
What do you mean by the sentence below, can you elaborate a bit more?
How can it happen that we get a hash at a key that only contains
failedReason
andstacktrace
.
Ok, if what you mean is that you have examined the job key and there are only 2 fields, failedReason and stacktrace, the explanation would be that the job has already completed, removed automatically, but some worker that was working on the job before got stalled, finally completed the process and when it tried to move the job to completed it failed with the errors above (since the job was already completed by a different workers), and since the job was removed previously, the only that remains are the two last fields written to the job hash.
What kind of processing are you doing, is it very CPU intensive?
But saveStacktrace
does an EXISTS check on the job id key so if the job was already processed and removed by another worker it shouldn't write those fields, or should it?
What kind of processing are you doing, is it very CPU intensive?
Not really. The workers are calling external APIs and are mostly waiting for their response and then storing things in the database, again waiting mostly on the database. CPU utilization is around 0.25-0.4 cores. Far away from the 2 cores we reserved for each of them.
But
saveStacktrace
does an EXISTS check on the job id key so if the job was already processed and removed by another worker it shouldn't write those fields, or should it?
ah ok, then I don't know. It was a theory of what I thought was happening. If you are not running CPU expensive jobs then you should not get stalled jobs unless the worker dies unexpectedly.
Hmm so do you think this might be a bug?
In my experience we can never fully prevent services from dying unexpectedly, or the network having issues for that matter.
Unless we are using BullMQ wrong, my expectation would be that even with stalled jobs and workers dying, the removeOnFail: true
and removeOnComplete: true
should eventually cause jobs to get reliably removed.
If we can't rely on jobs to be removed, we would need to think about some kind of cron job that looks through all jobs periodically and delete the ones that look suspicious. However I would rather not interfere with the inner workings of BullMQ like this.
In my experience we can never fully prevent services from dying unexpectedly, or the network having issues for that matter.
That's why we have the stalled check mechanism to begin with, to handle unexpected cases. I do not know if it is a bug or not, but it seems like there is something special going on in your setup since this behavior has not been reported before. I cannot rule out that there is a bug in BullMQ that in some edge cases would end in an empty job consisting only of a failedReason
field, but even in that case it would be pretty uncommon, it is quite suspicious if this is something that happens so often in your setup that it is creating a problem.
Hmm, could be something in our setup, of course.
That said, I just had another look at https://github.com/taskforcesh/bullmq/issues/1662 and in the screenshot of the stuck job there are also only the two fields.
I think the issue we are experiencing might actually be a duplicate of that one after all?
hey @nlochschmidt, I think I found the root cause of this issue, I opened a pr for it, we were using the old saveStacktrace function instead of the new saveStacktrace function on pro version so now with my pr it should use the expected script
Thanks @roggervalf.
I guess this will also fix the second type of stuck jobs that says Job xyz is not in the active state. finished
, right?
Looking forward to the next release 🙏
Hi @nlochschmidt, yeah the last fix should also handle this case as in this test case https://github.com/taskforcesh/bullmq/pull/1918, removed jobs should not save stacktrace
@nlochschmidt the fix was already released yesterday, just upgrade to version v5.3.2
Alright. I didn't see it in the changelog, that's why I asked https://docs.bullmq.io/bullmq-pro/changelog
Background: Our use-case is that we have implemented a polling mechanism and we want to rely on
jobId
s to make sure we don't poll the same thing concurrently. We have been using bull so far and are looking to switch over to BullMQ Pro after this discussion https://github.com/OptimalBits/bull/issues/2594. We have now transitioned 1% of our workload to BullMQ Pro and are unfortunately seeing a similar issue to the one linked 🙈.What we experience is a slow built-up of stuck jobs in production. The jobs can be found in redis under the key but are in none of the queues.
There are two types of failed jobs:
Type 1:
Type 2:
I haven't been able to reproduce this behavior with a minimal test case just yet, but I was hoping that maybe someone can see what's going on based on the provided information.
Our queue configuration is:
Our worker configuration is:
and we schedule jobs with these options:
BullMQ Pro version 5.2.0
Help would be very much appreciated 🙏
P.S: This might or might not be a duplicate of https://github.com/taskforcesh/bullmq/issues/1662 however we are using BullMQ Pro and the Groups feature, so I decided to open a new report here instead.