subprocesses failing quietly

project8 / dragonfly

a dripline-based slow-control implementation

Other

0 stars 0 forks source link

When a subprocess exits, there is no generic notification to the operator. While this isn't a bad design, in practice most of our subprocesses are not intended to be limited in duration and so this creates a trap where it takes a long time to notice that the subprocess we cared about isn't behaving as designed.

Typical design:

a service (like the operator AtOnCall) inherits from SlowSubprocessMixin https://github.com/project8/dragonfly/blob/develop/dragonfly/implementations/at_on_call.py
on startup a setup call is issued to start the control (and worker) processes https://github.com/project8/hardware/blob/master/software_config/dragonfly/claude/expert.yaml
when the worker hits some failure mode, it exits with whatever cleanup has been implemented https://github.com/project8/dragonfly/blob/develop/dragonfly/implementations/subprocess_mixin.py#L70
- in the case of AtOnCall, this includes a cleanup message to slack that it's been terminated, so we saw this promptly, but this isn't necessarily universal behavior
the pinger continues to hit the spimescape and report it as responding, but doesn't recognize that this is only a useless shell

Things that could be done - either you make the restart more automatic (a) or make the crash more obvious. (a) probably easiest in modifying subprocess mixin basic_control_target (or have a continuous control version) where on failing the is_alive check, it restarts the worker (b) the strongest method would be to overwrite the ping functionality so it checks if the worker is alive (only works for a single level of worker); could also make the cleanup method spam a lot more errors

either way, could check the cleanup behavior of classes inheriting to ensure they put out sufficient exit information and harden against those failure modes

I'd have to go read up on the subprocess API, but it should be possible to check if the subprocess exited cleanly or with an error. Around subprocess_mixin.py#L70 (linked above), I'd think we could check/catch any dripline exceptions from the subprocess and raise them. This would, in principle, kill the service, ideally producing a p8_alert, and allow the process manager to restart the process.

I think it makes sense that in most cases, if the subprocess that was supposed to be running fails we should exit the program and let it be restarted, rather than having lots of exception handling and restarting. (This is different, possibly, from the desired behavior when responding to a discrete dripline request, where it makes sense to catch dripline errors and pass them back in the reply while letting the service keep running). ... Basically I think it is easier to get the process to die and let it be restarted in a known clean state (everything is stateless so it should come up fine, right?), rather than trying to handle exceptions, clear things, restore some reasonably clean state, then go through a restart process.

An added bonus is that if the failure is due to something external to the program, then the process management can see the repeated crashes and have a backoff and clearly show a failed state. If the service is stuck in a loop where it keeps trying and failing to restart then we may get slack alarms (could in both designs), but the service itself looks fine.

project8 / dragonfly

subprocesses failing quietly #205