stackhpc / ansible-role-openhpc

Ansible role for OpenHPC
Apache License 2.0
45 stars 15 forks source link

Fix issues with slurm daemon startup when adding nodes #85

Closed sjpb closed 3 years ago

sjpb commented 3 years ago

Fixes https://github.com/stackhpc/ansible-role-openhpc/issues/84 - please read comments there describing issue!

The fix here is to:

This is a change in behaviour - changing the slurm.conf for any reason will now restart all the daemons.

NB: the check for this (check10) fails on idempotence - that is not a "real" failure and is due to the fact the converge part of the test has to modify the cluster to test adding a node ... needs fixing but that's complex.

sjpb commented 3 years ago

Ok I don't think this is good enough. It still doesn't restart the daemons on ALL nodes which is what the slurm docs state is required on adding/removing nodes. Currently restarts slurmctld and starts slurmd on the new node (only). It appears to work but I think it's probably fragile.

sjpb commented 3 years ago

NB: it's probably impossible to guarantee restarting all the slurmds as they might not be in the play!

sjpb commented 3 years ago

@JohnGarbutt , @jovial what do you think please? It works but a) is pretty complex in terms of ordering etc and b) requires the play adding the nodes to run on the entire cluster. I'm kind-of tempted to (as JG suggested) rip out all the slurmctld/slurmd handlers, just leaving the munge/slurmdb ones, and potentially add some plays for:

sjpb commented 3 years ago

Note that this PR means that: