Open martijnkruiten opened 6 years ago
This can already be done with SLURM_SC_OFFLINE_ARGS
, so I'm closing this pull request.
I closed it too soon. node-mark-offline
is currently incompatible with reboot ASAP
, because it expects fewer arguments:
SLURM_SC_OFFLINE_ARGS="update State=DRAIN"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS NodeName=$HOSTNAME Reason="$LEADER $NOTE"
Versus:
SLURM_SC_OFFLINE_ARGS="reboot ASAP"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS $HOSTNAME
I'm working on an improved version with Slurm 18.08 support (NextState
and Reason
arguments), handling of existing notes (similar to node-mark-offline
) and renamed variables (SLURM_SC_OFFLINE_ARGS
becomes SLURM_SC_REBOOT_ARGS
). This is done in a private repository, but eventually I will push it to this branch.
I've got an internal version that we use. I'm going to push it to this branch.
Ok, so the difference between node-mark-offline and node-mark-reboot is only a few lines, so they can easily be merged into one helper. The main issue is that scontrol reboot
expects the nodenames in a different format, so the helps should be aware of the value of SLURM_SC_REBOOT_ARGS
, or there should be another environment variable to set it to reboot. I guess the latter is a lot cleaner.
The mark-node-online helper can cancel pending reboots (if the node is healthy again) or mark nodes online after a reboot. We've opted to reboot them with NextState=DOWN
to avoid boot loops. We are running NHC during the boot sequence and at that point they are either left in a drained state or resumed. If NHC is run in the prolog and/or epilog only a different approach would be to set NextState=RESUME
and do something inside the helper to avoid boot loops.
For anyone looking to use this helper: this would work perfectly with something like this (I'm referring to the service file). That's because we've opted to let the node return in a drained state, so it will only be resumed if NHC is run during the boot process (or manually). That's by design, because we don't want to trigger a boot loop during the prologue, and we also like to avoid scheduling a job on a node before we know for sure that it's in a good state after the reboot.
We use it like this in nhc.conf
:
<target> || export OFFLINE_NODE="$HELPERDIR/node-mark-reboot"
<target> || <test that should trigger reboot>
<target> || export OFFLINE_NODE="$HELPERDIR/node-mark-offline"
<target> || <test that should trigger drain>
Alternatively, there is this pull request that tries to handle it differently.
I added a helper script to mark nodes for reboot. It's based on
node-mark-offline
, but executesscontrol reboot ASAP <node>
instead. This helper script can be used by settingOFFLINE_NODE
to$HELPERDIR/node-mark-reboot
. This is useful for checks that need a reboot when failed. It's compatible with Slurm only.