Closed jrray closed 2 months ago
I think the biggest concern is linking the lifetime of spfs-fuse to spfs-monitor. At face value that's what we want in order to solve the zombie spfs-fuse problem, but if spfs-monitor is crashing unexpectedly which the runtime is still in use then the fuse filesystem would be shutting down prematurely.
If spfs-monitor is crashing that's a serious problem that needs to be sorted out. This heartbeat link may be dangerous until that happens. But is it crashing, or is it getting killed? This is unclear. If it is getting killed (with -9) because a job is killed in CI or on the farm or whatever, then before this change spfs-fuse would keep running forever, and after this change it would terminate.
As mentioned in #895, there are cases where spfs-fuse never shuts down. An easy way to repro this is to
kill -9
the spfs-monitor process, so it doesn't get a chance to clean up the runtime normally.We observe spfs-fuse processes accumulating over time on our CI runners. It could be from users canceling pipelines (maybe the runner does a
kill -9
of all the child processes?). It is also extremely common for zombie runtimes to accumulate over time (not just on CI runners), perhaps spfs-monitor is crashing / failing to cleanup commonly?This is a proof of concept and enabling this or tuning the timing of the heartbeat and timeout should be a configurable thing but these are hard coded for now.Configuration added!