troglobit / finit

Fast init for Linux. Cookies included
https://troglobit.com/projects/finit/
MIT License
622 stars 61 forks source link

Fix shutdown hooks not running #338

Closed JackNewman12 closed 1 year ago

JackNewman12 commented 1 year ago

The shutdown hooks do not have any registered callbacks. So calling plugin_run_hooks() ends up doing nothing.

I think it makes sense to handle these special case hook scripts like this, but not sure if skipping some of the extra checks in plugin_run_hooks() makes a difference. I am guessing we don't care about the cond_set() and service_step() calls when everything has already exited.

troglobit commented 1 year ago

Curious, when I recently implemented this I made sure to test all these cases, and it still works for me without your change. There's a piece of code to ensure scripts are called this late (or early at bootstrap):

https://github.com/troglobit/finit/blob/910dac5f3a446b3ce44a1031c9f48cf079349f62/src/plugin.c#L243-L252

I wanted to ensure all hooks are called using the same API, in case of future changes. Having different entry points is hard to manage over time.

However, I'm inclined to pull in your patch anyway. Perhaps even extend upon it to do the same for the early hook points as well, for symmetry. But first I'm very curious to understand why the current code doesn't work in your case -- feels like there's something else hidden here.

JackNewman12 commented 1 year ago

Just had a random thought about this and will have to check on Monday. I bet I mindlessly added a /var/run/finit/cond/usr folder on the root filesystem as part of the build. I mkdir this folder at startup since finit didn't seem to create it for me and it is required to get usr conditions working (TBD if this is fixed now).

This probably goes unnoticed since /var gets mounted over it immediately (and then the startup scripts run) so everything continues as normal. Hence once all the filesystems are unmounted, that cond_is_available() check doesn't work.

If my hunch is correct it's probably a my-bad, although I'd argue that a bad folder structure, or failure to unmount /var shouldn't stop the shutdown scripts from running.

troglobit commented 1 year ago

Ah, yeah I think I ran into the same issue. Both usr and sys condition sub-directories were missing in some cases, seems to have been the root cause of the failing auto tests.

The /var/run (or just /run these days) will be given a ramdisk by Finit if it's not already mounted by the user in /etc/fstab. The order of the plugins that create all sub-directories is important, but the order had been lost. Commits 8bbae3d and c22d123 (yesterday) should have it covered. Sorry about that!

JackNewman12 commented 1 year ago

Ahh. So /var/run gets unmounted. /var fails to umount for whatever reason. I can't see anything in lsof and I removed all the kernel modules, TBD why this happens. If I do a umount /var/run before reboot the /var umount is successful. 🤷‍♂️ Beyond me.

If I manually umount /var/run I can see the underlying /var has a hook symlink?

# ls -lha /var/run/finit/cond/hook/mount/
lrwxrwxrwx    1      26 Jan  1  1970 post -> /var/run/finit/cond/reconf

(this is the only file in /var/run/**)

I am guessing this file is getting created before /var/run gets mounted and so it ends up in /var? Hence the cond_is_available() check is always true during a shutdown, even after unmounting /var/run. Is the mount() command blocking? Is this race condition between the mount and this symlink getting created?

Note adding the commits https://github.com/troglobit/finit/commit/8bbae3d414b066766d1ee88e1522517955fd1ed9 and https://github.com/troglobit/finit/commit/c22d123ab07a77e09bd6aab233907299f7a36735 mentioned above don't fix this.

I would guess https://github.com/troglobit/finit/blob/02a110e6c26fc6e96b2eeb8efba2d777eb186946/src/plugin.c#L247-L252 needs a return in it, or https://github.com/troglobit/finit/blob/02a110e6c26fc6e96b2eeb8efba2d777eb186946/src/plugin.c#L261-L266 needs to check if cond_is_available() so that this symlink doesn't get created by cond_set_oneshot(). But making those changes will stop /var/run getting mounted in the first place.

Its a bit above my paygrade now. But either way having a umount failure shouldn't stop the shutdown scripts from running.

troglobit commented 1 year ago

Awesome, that's exactly what I was looking for, thanks! Explains a lot that this happens on a system where /var/run is not a symlink to /run, all my own systems have migrated to a separate RAM disk on /run, which Finit mounts automatically if missing from /etc/fstab.

Yeah you're right, I'll have to look into those places, and build on this PR as well to improve robustness at shutdown, as well as boot (in case someone starts with a messed up rootsfs ...)

Thank you so much for helping find the root cause!

troglobit commented 1 year ago

There, should work better now. I've adjusted the boundaries for calling cond_set_oneshot(), and also added a check for cond_is_available(), just like you suggested.

Turns out cond_set_oneshot() calls a couple of intermediate functions that in turn do a cond_checkpath(), which creates any missing directory, including /var/run/finit, if missing! So that was the root of it all.

Great to find and nail this one down before the next release!

JackNewman12 commented 1 year ago

Does that change work for you? I made the same change locally and while it fixed the symlink issue it caused /var/run to never get mounted by finit. (It's not in my fstab)

troglobit commented 1 year ago

What does your filesystem layout look like? The recommended one is "new" style dedicated mount point for /run and a compat symlink back to it, like this /var/run -> ../run.

Finit only mounts /run, look here:

https://github.com/troglobit/finit/blob/02a110e6c26fc6e96b2eeb8efba2d777eb186946/src/finit.c#L390-L405

troglobit commented 1 year ago

Updated the docs a bit to detail the fs recommendations. Hope it helps:

https://github.com/troglobit/finit/tree/master/doc#filesystem-layout

JackNewman12 commented 1 year ago

Found it. Looks like a few years ago whoever created the /var/run -> /run symlink actually created it as /run -> /var/run in our system. 🤦‍♂️ I guess having finit accidently create those /var/run/finit/cond ... directories was then letting it correctly mount /run later.

Swapped the symlink around and now everything mounts as expected. Cheers

troglobit commented 1 year ago

Nice find, and excellent news!

Yup, that sounds very likely from what I saw myself.

Glad to hear it all worked out 😃