project8 / dragonfly

a dripline-based slow-control implementation
Other
0 stars 0 forks source link

subprocesses failing quietly #205

Open wcpettus opened 5 years ago

wcpettus commented 5 years ago

When a subprocess exits, there is no generic notification to the operator. While this isn't a bad design, in practice most of our subprocesses are not intended to be limited in duration and so this creates a trap where it takes a long time to notice that the subprocess we cared about isn't behaving as designed.

Typical design:

Things that could be done - either you make the restart more automatic (a) or make the crash more obvious. (a) probably easiest in modifying subprocess mixin basic_control_target (or have a continuous control version) where on failing the is_alive check, it restarts the worker (b) the strongest method would be to overwrite the ping functionality so it checks if the worker is alive (only works for a single level of worker); could also make the cleanup method spam a lot more errors

laroque commented 5 years ago

I'd have to go read up on the subprocess API, but it should be possible to check if the subprocess exited cleanly or with an error. Around subprocess_mixin.py#L70 (linked above), I'd think we could check/catch any dripline exceptions from the subprocess and raise them. This would, in principle, kill the service, ideally producing a p8_alert, and allow the process manager to restart the process.

I think it makes sense that in most cases, if the subprocess that was supposed to be running fails we should exit the program and let it be restarted, rather than having lots of exception handling and restarting. (This is different, possibly, from the desired behavior when responding to a discrete dripline request, where it makes sense to catch dripline errors and pass them back in the reply while letting the service keep running). ... Basically I think it is easier to get the process to die and let it be restarted in a known clean state (everything is stateless so it should come up fine, right?), rather than trying to handle exceptions, clear things, restore some reasonably clean state, then go through a restart process.

An added bonus is that if the failure is due to something external to the program, then the process management can see the repeated crashes and have a backoff and clearly show a failed state. If the service is stuck in a loop where it keeps trying and failing to restart then we may get slack alarms (could in both designs), but the service itself looks fine.