Open salmonsteak1 opened 1 day ago
I've run into this as well. We implemented a query to find them, and then try to requeue them. But, it seems we've not done this well, and it's turned into a bit of a mess. It seems like when the process is terminated, we should throw them back to the queue to attempt again.
@salmonsteak1, oh yes! I forgot to update the README on https://github.com/rails/solid_queue/pull/277. In-flight jobs whose worker terminates abnormally will now be marked as failed, just like if the job failed, so you can inspect them and retry/discard/fix them manually.
The reason is that the worker might be crashing because of the job. For example, a job that has a memory leak causes some monitoring process to kill a worker because of memory consumption. If we simply put the job back into the queue when the process is killed, another worker will pick it up just to be killed and so on. I need to update the README.
@wflanagan you can use Mission Control to inspect these jobs and retry them manually. If you are getting these often/regularly, something must be up with your setup, as it should be an exceptional situation. Regular deploys, if they're implemented with a small shutdown time where workers have time to terminate orderly would not cause this, as in-flight jobs whose workers terminate orderly but didn't have time to terminate them will be put back into the queue.
Hey there, the README states this :
From what I understand, this means that when the plug is pulled or the worker is terminated unexpectedly, jobs that are still in the midst of processing by that worker will be released back to their queue the next time the supervisor checks and prunes processes with expired heartbeats.
However, it seems like I'm getting these
ProcessPrunedError
:I believe this happens in the event of the "plug being pulled" (or something similar like forcefully shutting down a kubernetes pod), the supervisor will also be killed and the job will be stuck with the worker. I'm just not sure when/where this
ProcessPrunedError
is being generated, and if there's anything we can do about it for it to be re-enqueued