Open flakrat opened 11 months ago
I ran a few tests and it appears that calling nhcmain_finish
works to bypass the code that drain/un-drains the node, however I believe that this would also bypass processing checks further down the line.
I guess putting this particular check at the end of nhc.conf
would mitigate this, but it's still hacky.
So to make sure I understand... You want the check to fail if the correctly curl
'd metric is above a certain threshold, but you want it to pass if it can't obtain a valid metric to test against, though in this case you don't want the node put back into service either?
At present, NHC doesn't really have a "soft fail" or a concept of a partially (un)healthy node, and that was really by design. You can, however, make changes to existing configuration values from within the code for your check. So if you wanted the check to pass but disallow "undraining" of the node, you can do something like ONLINE_NODE=:
and then return 0
from your check. This would still allow subsequent checks to drain the node if they failed but keep an otherwise healthy node from being returned to service. Is that what you're wanting?
Feel free to share the code in question if that might help clarify what you're shooting for here! 😀
So if you wanted the check to pass but disallow "undraining" of the node, you can do something like ONLINE_NODE=: and then return 0 from your check. This would still allow subsequent checks to drain the node if they failed but keep an otherwise healthy node from being returned to service.
This is what I'm after, thanks:
Here's the code: https://gitlab.rc.uab.edu/rc/rc-nhc/-/blob/main/uabrc_hw.nhc
Howdy, we have a custom check that retrieves a metric value from Prometheus using
curl
.Edit: we are using Slurm as our resource manager.
The check works great, however I need to add code to the check to prevent NHC from changing the state of the node (drained, un-drained) if the curl command fails, examples:
Is there a way to return from the function where NHC would not make any changes to the node?
return 0
indicates no failure and triggers anun-drain
if the node is already drained, so I can't use thatreturn 1
or any number indicates failure and drains the node.Thanks,
Mike Hanby UAB IT Research Computing