troglobit / finit

Fast init for Linux. Cookies included
https://troglobit.com/projects/finit/
MIT License
633 stars 64 forks source link

Rebooting permanently stalls if finit shows [WARN] for an app when attempting to kill #227

Closed hongkongkiwi closed 1 year ago

hongkongkiwi commented 2 years ago

When rebooting, if a service has a "[WARN]" status the reboot never completes.

I was testing killing an app using kill <pid> and having finit to restart the app. finit doesn't seem to pick up the correct pid when this happens see bug #226

When this situation happens, I guess that finit gets "out of sync", so when doing: finit 6 to reboot, it stalls on the above app:

# finit 6
[FAIL] Saving sound settings
[FAIL] Saving random seed
[ OK ] Stopping System log daemon
[ OK ] Stopping Kernel log daemon
[ OK ] Stopping Chrony Time daemon
[WARN] Killing MyApp Media daemon

By stall, I mean it sits forever on the [WARN] line.

In normal cases finit 6 works totally fine as long as it can kill this app, but any "[WARN]" line seems to halt the rebooting process permanently (no matter how long I wait).

hongkongkiwi commented 2 years ago

Just to show this is not a fluke, for some reason dbus had the [WARN] status and the same situation happened:

# finit 6
[FAIL] Saving sound settings
[ OK ] Saving random seed
[FAIL] Stopping D-Bus message bus daemon
[WARN] Killing D-Bus message bus daemon

It will halt at this condition forever.

troglobit commented 2 years ago

Interesting, I'll have a look at this in detail and try to set up a testcase for it. We just had a PR for shutdown/kill so there might be a regression.

Just to make sure, which version of Finit are you running; the latest release, or a GIT version? (The PR I mentioned above is not released yet.)

troglobit commented 2 years ago

Progress: so far I've only been able to reproduce the [WARN], but for me the system reboots fine. I'm starting to suspect it's not the stopping of services that's at fault, but rather something else. Could you try calling initctl debug before initctl reboot?

[ OK ] Stopping Web interface
[WARN] Killing Simple NTP daemon
[    9.157661] reboot: Restarting system
troglobit commented 2 years ago

So, the fix to this issue in 7dc7f9a handles the "stall" in reboot. The actual root cause, which you hinted to, really seems to be #226. See that issue for an update on that as well.

hongkongkiwi commented 2 years ago

Oh that's great, sorry I didn't get a debug log earlier, we are doing some system porting and I had to switch (temporarily) to another project. I'm really glad to were able to find the cause of this, we are on an embedded platform, so having it not behave as expected when shutting down was quite challenging.

This was a little bit inconsistent for me to replicate, but I'll try the latest version. Thanks for the fix!

troglobit commented 2 years ago

Yeah, I'm mostly on embedded systems as well, and reboot must always work. Hope it works better also for you :)

troglobit commented 1 year ago

Reopening, I just ran into this one myself trying to reboot and found the following:

...
finit[1]: service_kill():(null): Sending SIGKILL to process group 2577
finit[1]: Stopping pod:system[2577], sending SIGKILL ...
[WARN] Killing System container
...

After which everything just hung forever.

The interesting bit is the (null) above, it's from an internal function that looks in /proc/2577/status after the actual process name. Here it could not find one, and the only way for that function to fail is if 2577 no longer exists!

Analysis

For my use-case pod:system[2577] is a podman container, which as it turns out, starts conmon to monitor the container. However, the PID 2577 that it returned in the container pidfile was for that system's init process, not conmon itself. conmon is a process monitor and sub-reaper, hence Finit never got any feedback to proceed and the service_kill() function exited early leaving Finit to wait forever ...