Open eternal-flame-AD opened 1 month ago
Same issue.
I had the same issue, stuff from e-komik.org through a relay stalled the whole queue. blocking the relay didn't work but blocking e-komik.org seems to have cleared it up.
Update: I read the source and could not find exactly where the issue was. However I see similar code paths to 'skip: LD-Signatureの検証に失敗しました' in today's log after I removed the offending relay and there was no stall. It seems to be something specific to the e-komik event.
Maybe add a hard timeout to the jobs, I think if one job takes more than 1 minute most instance owners would rather not receive it? ref: https://docs.bullmq.io/patterns/timeout-jobs
💡 Summary
"skip: LD-Signatureの検証に失敗しました" was marked as a "unrecoverableerror" but doesn't remove the message from inbox pipeline, stalling it.
🥰 Expected Behavior
If an inbox message can't be accepted, it should be removed from the pipeline immediately (or at least a retry that happens later in time) so as other messages don't get starved of workers.
🤬 Actual Behavior
I found today my instance has a >8h delay in receiving external notes. I went to see the Bull Dashboard and say 9000 pending inbox messages and growing. The CPU and memory utility was low. I investigated the running jobs and found they have been stalled for ~20 minutes. All show an error but are still in the "running state" (see reproduction section). Probably this stalled the pipeline causing the pile up.
Someone on misskey suggested it might be retrying behavior but (1) If it's a signature mismatch it should be permanent error (2) No job should take up a slot in the workers for 25 minutes without releasing it to someone else first.
Deleting the offending server alone did not resolve the issue (temporarily unblocks it but get's blocked again quickly), I need to manually purge the redis inbox of that relay for the problem to resolve.
Screenshot: https://mi.yumechi.jp/notes/9ydk0fspg51f000e (sorry I forgot to screenshot the "Active" tab, but it was the 16 jobs on the Active tab that got stalled.
📝 Steps to Reproduce
I don't know how to inject events manually into the queue so not sure how to exacly reproduce it again but I saved the event json, and the error on the Bull Dashboard. (I see an option to manually put in a JSON data+options but I don't know the underlying logic of misskey so wasn't comfortable just putting it in on my real instance, however I am will to try it out if someone on the team could give me a JSON to put in)
All activities stalled on the queue look like this (it's different events but same type and data structure, from the same relay).
Bulls dashboard shows this error:
docker and docker-compose logs for some reason stops working as soon as it's stalled ... i hope the above is enough information.
💻 Frontend Environment
🛰 Backend Environment (for server admin)
Do you want to address this bug yourself?
I don't have time looking at the source today but I am willing to submit a PR if it is within my capabilities.