spkenv / spk

A Package Manager for high velocity software environments, built on spfs.
https://spkenv.dev
Apache License 2.0
40 stars 6 forks source link

Add a heartbeat between spfs-monitor and spfs-fuse #1111

Closed jrray closed 2 months ago

jrray commented 3 months ago

As mentioned in #895, there are cases where spfs-fuse never shuts down. An easy way to repro this is to kill -9 the spfs-monitor process, so it doesn't get a chance to clean up the runtime normally.

We observe spfs-fuse processes accumulating over time on our CI runners. It could be from users canceling pipelines (maybe the runner does a kill -9 of all the child processes?). It is also extremely common for zombie runtimes to accumulate over time (not just on CI runners), perhaps spfs-monitor is crashing / failing to cleanup commonly?

This is a proof of concept and enabling this or tuning the timing of the heartbeat and timeout should be a configurable thing but these are hard coded for now. Configuration added!

jrray commented 3 months ago

I think the biggest concern is linking the lifetime of spfs-fuse to spfs-monitor. At face value that's what we want in order to solve the zombie spfs-fuse problem, but if spfs-monitor is crashing unexpectedly which the runtime is still in use then the fuse filesystem would be shutting down prematurely.

If spfs-monitor is crashing that's a serious problem that needs to be sorted out. This heartbeat link may be dangerous until that happens. But is it crashing, or is it getting killed? This is unclear. If it is getting killed (with -9) because a job is killed in CI or on the farm or whatever, then before this change spfs-fuse would keep running forever, and after this change it would terminate.