rails / solid_queue

Database-backed Active Job backend
MIT License
1.95k stars 130 forks source link

`SolidQueue::Processes::ProcessPrunedError` shown in "Failed jobs" and not being released back to their queues #422

Open salmonsteak1 opened 1 day ago

salmonsteak1 commented 1 day ago

Hey there, the README states this :

If processes have no chance of cleaning up before exiting (e.g. if someone pulls a cable somewhere), in-flight jobs might remain claimed by the processes executing them. Processes send heartbeats, and the supervisor checks and prunes processes with expired heartbeats, which will release any claimed jobs back to their queues. You can configure both the frequency of heartbeats and the threshold to consider a process dead. See the section below for this.

From what I understand, this means that when the plug is pulled or the worker is terminated unexpectedly, jobs that are still in the midst of processing by that worker will be released back to their queue the next time the supervisor checks and prunes processes with expired heartbeats.

However, it seems like I'm getting these ProcessPrunedError:

image image

I believe this happens in the event of the "plug being pulled" (or something similar like forcefully shutting down a kubernetes pod), the supervisor will also be killed and the job will be stuck with the worker. I'm just not sure when/where this ProcessPrunedError is being generated, and if there's anything we can do about it for it to be re-enqueued

wflanagan commented 10 hours ago

I've run into this as well. We implemented a query to find them, and then try to requeue them. But, it seems we've not done this well, and it's turned into a bit of a mess. It seems like when the process is terminated, we should throw them back to the queue to attempt again.

rosa commented 9 hours ago

@salmonsteak1, oh yes! I forgot to update the README on https://github.com/rails/solid_queue/pull/277. In-flight jobs whose worker terminates abnormally will now be marked as failed, just like if the job failed, so you can inspect them and retry/discard/fix them manually.

The reason is that the worker might be crashing because of the job. For example, a job that has a memory leak causes some monitoring process to kill a worker because of memory consumption. If we simply put the job back into the queue when the process is killed, another worker will pick it up just to be killed and so on. I need to update the README.

@wflanagan you can use Mission Control to inspect these jobs and retry them manually. If you are getting these often/regularly, something must be up with your setup, as it should be an exceptional situation. Regular deploys, if they're implemented with a small shutdown time where workers have time to terminate orderly would not cause this, as in-flight jobs whose workers terminate orderly but didn't have time to terminate them will be put back into the queue.