Question: Custom Check, How to exit without any changes, i.e. leave node in current state?

mej / nhc

LBNL Node Health Check

Other

213 stars 78 forks source link

Question: Custom Check, How to exit without any changes, i.e. leave node in current state? #139

Open flakrat opened 11 months ago

flakrat commented 11 months ago

Howdy, we have a custom check that retrieves a metric value from Prometheus using curl.

Edit: we are using Slurm as our resource manager.

The check works great, however I need to add code to the check to prevent NHC from changing the state of the node (drained, un-drained) if the curl command fails, examples:

The Prometheus server is not responding
The query doesn't return any metric (could happen if node_exporter died on the node)

Is there a way to return from the function where NHC would not make any changes to the node?

return 0 indicates no failure and triggers an un-drain if the node is already drained, so I can't use that
return 1 or any number indicates failure and drains the node.

Thanks,

Mike Hanby UAB IT Research Computing

flakrat commented 11 months ago

I ran a few tests and it appears that calling nhcmain_finish works to bypass the code that drain/un-drains the node, however I believe that this would also bypass processing checks further down the line.

I guess putting this particular check at the end of nhc.conf would mitigate this, but it's still hacky.

mej commented 10 months ago

So to make sure I understand... You want the check to fail if the correctly curl'd metric is above a certain threshold, but you want it to pass if it can't obtain a valid metric to test against, though in this case you don't want the node put back into service either?

At present, NHC doesn't really have a "soft fail" or a concept of a partially (un)healthy node, and that was really by design. You can, however, make changes to existing configuration values from within the code for your check. So if you wanted the check to pass but disallow "undraining" of the node, you can do something like ONLINE_NODE=: and then return 0 from your check. This would still allow subsequent checks to drain the node if they failed but keep an otherwise healthy node from being returned to service. Is that what you're wanting?

Feel free to share the code in question if that might help clarify what you're shooting for here! 😀

flakrat commented 10 months ago

So if you wanted the check to pass but disallow "undraining" of the node, you can do something like ONLINE_NODE=: and then return 0 from your check. This would still allow subsequent checks to drain the node if they failed but keep an otherwise healthy node from being returned to service.

This is what I'm after, thanks:

Here's the code: https://gitlab.rc.uab.edu/rc/rc-nhc/-/blob/main/uabrc_hw.nhc