quattor / ncm-ncd

Node Configuration Dispatcher Framework for Components
www.quattor.org
Other
3 stars 8 forks source link

ncm-ncd: timeout on stuck NCM components #114

Open msmark opened 7 years ago

msmark commented 7 years ago

As discussed on the mailing list, ncm-ncd should have a configurable timeout setting that will cause NCM components that have stuck indefinitely for whatever reason not to prevent ncm-ncd from continuing with its job and completing with an exit status. ncm-ncd should behave in one of three ways:

1) Current behaviour, i.e. timeout set to zero means never timeout. 2) Alert if a component times out, but continue to wait. 3) Alert if a component times out, kill/clean-up that component and continue with the next components. Any components that depend on the killed one cannot be run, of course and should also be reported as errors with a note that the parent component failed.

Without a timeout, if any component hangs indefinitely, so does ncm-ncd and subsequent runs of ncm-ncd cannot take place due to the lock file. This has left one affected system in a state where nothing was updated for a month and nobody noticed.