mej / nhc

LBNL Node Health Check
Other
226 stars 79 forks source link

WIP: Node mark reboot helper #65

Open martijnkruiten opened 6 years ago

martijnkruiten commented 6 years ago

I added a helper script to mark nodes for reboot. It's based on node-mark-offline, but executes scontrol reboot ASAP <node> instead. This helper script can be used by setting OFFLINE_NODE to $HELPERDIR/node-mark-reboot. This is useful for checks that need a reboot when failed. It's compatible with Slurm only.

martijnkruiten commented 5 years ago

This can already be done with SLURM_SC_OFFLINE_ARGS, so I'm closing this pull request.

martijnkruiten commented 5 years ago

I closed it too soon. node-mark-offline is currently incompatible with reboot ASAP, because it expects fewer arguments:

SLURM_SC_OFFLINE_ARGS="update State=DRAIN"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS NodeName=$HOSTNAME Reason="$LEADER $NOTE"

Versus:

SLURM_SC_OFFLINE_ARGS="reboot ASAP"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS $HOSTNAME
martijnkruiten commented 5 years ago

I'm working on an improved version with Slurm 18.08 support (NextState and Reason arguments), handling of existing notes (similar to node-mark-offline) and renamed variables (SLURM_SC_OFFLINE_ARGS becomes SLURM_SC_REBOOT_ARGS). This is done in a private repository, but eventually I will push it to this branch.

martijnkruiten commented 4 years ago

I've got an internal version that we use. I'm going to push it to this branch.

martijnkruiten commented 4 years ago

Ok, so the difference between node-mark-offline and node-mark-reboot is only a few lines, so they can easily be merged into one helper. The main issue is that scontrol reboot expects the nodenames in a different format, so the helps should be aware of the value of SLURM_SC_REBOOT_ARGS, or there should be another environment variable to set it to reboot. I guess the latter is a lot cleaner.

The mark-node-online helper can cancel pending reboots (if the node is healthy again) or mark nodes online after a reboot. We've opted to reboot them with NextState=DOWN to avoid boot loops. We are running NHC during the boot sequence and at that point they are either left in a drained state or resumed. If NHC is run in the prolog and/or epilog only a different approach would be to set NextState=RESUME and do something inside the helper to avoid boot loops.

martijnkruiten commented 2 years ago

For anyone looking to use this helper: this would work perfectly with something like this (I'm referring to the service file). That's because we've opted to let the node return in a drained state, so it will only be resumed if NHC is run during the boot process (or manually). That's by design, because we don't want to trigger a boot loop during the prologue, and we also like to avoid scheduling a job on a node before we know for sure that it's in a good state after the reboot.

We use it like this in nhc.conf:

<target> || export OFFLINE_NODE="$HELPERDIR/node-mark-reboot"
<target> || <test that should trigger reboot>
<target> || export OFFLINE_NODE="$HELPERDIR/node-mark-offline"
<target> || <test that should trigger drain>

Alternatively, there is this pull request that tries to handle it differently.