mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

nhc: Refactor watchdog timer code for reuse #132

Closed mej closed 1 year ago

mej commented 1 year ago

From the log entries for the included commits (1dd668e57eaabd621153b08771c224ada6d806c0 and b171d831f492a08b06fa3fc52ce1dab578d5b4d2):

For some reason (probably simplicity), the check_cmd_status() check function was using a different method of spawning a subprocess with a timout than check_cmd_status() was. The nhc_cmd_with_timeout() function was written specifically to facilitate consistency in coding that exact functionality, but only check_cmd_output() was using it. The check_cmd_status() check function was launching the subcommand directly and trying to use nhcmain_watchdog_timer() to create its watchdog timer process.

As observed in https://github.com/mej/nhc/issues/104, check_cmd_status() (but not check_cmd_output()) was leaving behind watchdog timer and sleep processes that should have been terminated when the subcommand exited. Unfortunately, nhcmain_watchdog_timer() was not written with this use case in mind, nor was kill_watchdog() expecting to have to clean up multiple child processes.

This will be addressed in 2 ways. For the "fix-only" 1.4.4 tree, I've switched check_cmd_status() to using nhc_cmd_with_timeout() since it uses a different mechanism and has not displayed this behavior. The longer term fix will be to refactor the watchdog code in nhc itself and use that code whenever launching subcommands is needed.

This should address https://github.com/mej/nhc/issues/104 for the 1.4.4 branch but will be applied to both for now, once tested and verified.


As referenced in the above commits, the longer-term fix for the 1.5+ branch is a refactoring of all the watchdog timer code in nhc so that multiple distinct timers can be managed simultaneously, including their termination in case of successful subprocess/program exit. Lack of proper cleanup was ultimately the key cause of https://github.com/mej/nhc/issues/104's leaked shell+sleep processes.

NOTE: The nhc script itself does not keep track of all the PIDs for all the timers it has spawned off, only the main one (for the top-level nhc process). Any other timer PIDs must be tracked by whatever spawned them. In particular, nhc_cmd_with_timeout() tracks both the task and the timer PIDs and ensures that both processes have exited before it returns.

mej commented 1 year ago

Good to go in.