spfs-fuse process can get left behind

rydrman commented 11 months ago

This is a bit of an edge case but wanted it documented nonetheless.

We had a host which was failing to start an spfs shell upon login via ssh. In this case the user would see this error (overlayfs + fuse):

WARN FUSE did not appear to start after delay: Connection refused (os error 111)                                                                                                      
mount: mount none on /spfs failed: Connection refused                                                                                                                                  
ERROR Failed to mount overlayfs

A look at the system journal would show this for spfs-fuse:

spfs[4439]:  INFO Filesystem initialized
spfs[4439]:  WARN Request RequestId(1): Failed to send reply: Invalid argument (os error 22)

In this case, spfs-fuse was running but spfs-enter failed because overlayfs couldn't be mounted. This meant that the monitor was never started and the spfs-fuse process would stick around forever.

I was not able to identify the underlying fuse issue, and rebooting the machine resolved the mount error so we moved on.

This issue is to try and track the failure state, and have a way in which these partial mounts can still be properly torn down on failure.

jrray commented 11 months ago

Perhaps we want to add some responsibility to spfs-enter to "tickle" a magic file in the fuse filesystem to let it know that everything got setup properly, otherwise without this it would shut itself down after a short grace period.

You can try to handle all the error cases and shut down spfs-fuse if something went wrong, but it is always possible for the thing that was supposed to do the cleanup to crash or be killed before it gets a chance.

rydrman commented 11 months ago

I'm thinking something inconspicuous like reading or setting an extended attribute on the root of the mount

jrray commented 1 month ago

https://github.com/spkenv/spk/blob/8151a88667f9fd4968ddfdaa5a93616a28c22c07/crates/spfs-cli/cmd-fuse/src/cmd_fuse.rs#L272-L275

FWIW I discovered in the documentation for abort that it doesn't work on join handles returned from spawn_blocking. Despite the comment attached here, with abort being a no-op we really weren't doing any kind of cleanup here. In my testing, using fuser's unmount method does nothing and there's no way to signal to fuser::Session::run's loop to terminate.

I've implemented a heartbeat connection between spfs-monitor and spfs-fuse, as suggested above (and recently in slack). Before adding this heartbeat, it is easily reproducible to get a spfs-fuse process hanging around forever by kill -9'ing the related spfs-monitor process. But with the heartbeat in place, spfs-fuse will eventually timeout and exit.

spkenv / spk

spfs-fuse process can get left behind #895