mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

NHC Helpers vs. Unknown Slurm States #126

Open mej opened 1 year ago

mej commented 1 year ago

https://github.com/mej/nhc/blob/375e7e028425fcf9da2653f707c9cb7af6c4a583/helpers/node-mark-offline#L88 https://github.com/mej/nhc/blob/375e7e028425fcf9da2653f707c9cb7af6c4a583/helpers/node-mark-online#L81

At present, the handling of unknown node states in Slurm is somewhat undefined/unspecified, but it shouldn't be. (It just echos a message and continues with whatever comes next.) The user should be able to control whether NHC considers unknown states to be errors or if they should be ignored.

What to do? Add either NHC_IGNORE_UNKNOWN_STATE or NHC_FAIL_UNKNOWN_STATE as a new config variable (preferably one or the other, not both) to allow the helpers to online/offline a node even if the node's state isn't recognized as valid.

For a solid, production-quality, commercially supported product, Slurm is still innovating at a fairly rapid pace. And as frequently as this involves adding new node states, I think being more explicit and giving the user control over this behavior would improve usability.

basvandervlies commented 1 year ago

@mej I agree to add the control to the user via env variables as Slurm innovates rapidly and new states are added

mej commented 1 year ago

Thanks Bas! :)