habitat-sh / habitat

Modern applications with built-in automation
https://www.habitat.sh
Apache License 2.0
2.61k stars 315 forks source link

Reconsider how we determine if a process is "alive" on Linux #6131

Open christophermaier opened 5 years ago

christophermaier commented 5 years ago

Our is_alive function is essentially a fancy version of the libc::kill. As such, it cannot distinguish between a process that exists and is alive, and a process that exists but is a zombie.

With the Supervisor taking on responsibility for shutting down services in #6107, this distinction becomes more important. Since the Launcher is still the parent of service processes, it needs to reap the service processes when they exit. If it doesn't, the Supervisor will still think they're "alive".

Normally, this isn't a problem because the Launcher regularly reaps its child processes. When the Supervisor is shutting down, however, the Launcher needs to continue reaping children as it waits for the Supervisor process itself to shut down. If not, the Supervisor will wait it's allotted 8 second timeout for the service processes to become "not alive" and will then send a KILL signal. This can delay shutdown unnecessarily.

This works, but it means, among other things, that is_alive is not very well named 😄

We can likely refactor is_alive on Linux to leverage the procinfo crate to distinguish between truly alive processes and zombies. One important wrinkle, however, is that we often call is_alive with a negative PID, which queries everything in the process group, rather than just a single process. If we were to use procinfo, we would need to handle the querying of process group members on our own.

See the discussion that spawned all this for further background.

There may be other uses of is_alive that need to be taken into account with any refactoring that takes place.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.