rails / solid_queue

Database-backed Active Job backend
MIT License
1.95k stars 130 forks source link

Handle scenarios where registered processes disappear outside regular flow #337

Closed rosa closed 2 months ago

rosa commented 2 months ago

This is a simple way to mitigate the effects of a computer going to sleep and then coming back to life, with processes with expired heartbeats that are, however, alive. It came up as part of the quest to reproduce another unrelated error in #324. It's not trivial for the supervisor to know whether a process is actually alive or not because the pid might have been reassigned by the OS to a different process. It can't rely on its registered supervisees either because we might be running multiple supervisors. We could possibly check its current forks, whether the pid matches, and in that case, skip the pruning, but I'm wary about introducing complicated logic there because the risk is ending up with dead registered processes and locked jobs.

Since this scenario should happen only in development, I opted for keeping the pruning logic the same (except from preventing the supervisor from pruning itself, as in that case, the supervisor running the pruning is very much alive and running) and just have each process check whether their registered process record is gone when they heartbeat. If it's gone, just stop as if a TERM or INT signal had been received. If that process was a supervised fork, its supervisor will replace it as soon as it realises it's gone.

This might be handy in the future as well, to stop a given worker from Mission Control.

Big thanks to @npezza93 for catching this scenario.

npezza93 commented 2 months ago

If i manually go into the db and trash all the processes i get this error:

image

supervisor.rb

        if registered_process = process&.supervisees&.find_by(name: terminated_fork.name)

Adding this patch fixes the error but a new supervisor never gets restarted.

rosa commented 2 months ago

If i manually go into the db and trash all the processes i get this error:

Ahhh! Yes, I was only considering the case where the processes get deregistered because they lost their heartbeats, not the case where you go and delete them all. I made sure the supervisor would never get deregistered by itself, so it should work... it might be the case that you are running multiple supervisors and one of them deregisters the other but I think you wouldn't do that in development (which is the case I'm handling here, as you'd expect things to break in production if suddenly the server running your jobs decides to go to sleep) 🤔

npezza93 commented 2 months ago

Ahhh! Yes, I was only considering the case where the processes get deregistered because they lost their heartbeats, not the case where you go and delete them all. I made sure the supervisor would never get deregistered by itself, so it should work... it might be the case that you are running multiple supervisors and one of them deregisters the other but I think you wouldn't do that in development (which is the case I'm handling here, as you'd expect things to break in production if suddenly the server running your jobs decides to go to sleep) 🤔

Thats a good point. i was just impatient and decided to nuke the records instead waiting for my mac to hibernate. Going the hibernation route this never gets raised so i think you can ignore more here

rosa commented 2 months ago

We're now running this in production in HEY to make sure everything is ok.