Closed sjpb closed 3 years ago
This is definitely repeatable as of ce57ea1. Things which DON'T fix it:
systemctl daemon-reload
before the reload/restartReload SLURM service
Although slurmCTLD says its started actually there's a munge failure - with odd date too:
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: error: slurm_unpack_received_msg: Protocol authentication error
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: error: slurm_receive_msg [10.60.253.64:49676]: Unspecified error
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: error: If munged is up, restart with --num-threads=10
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: error: Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: ENCODED: Wed Dec 31 23:59:59 1969
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: DECODED: Wed Dec 31 23:59:59 1969
But if I restart slurmCTLD this goes away then I can restart slurmD ok.
Ok here's the problem:
TASK [ansible-role-openhpc : Write Munge key] ***********************************************************************************************************************************************************************************************
ok: [testohpc-login-0]
changed: [testohpc-compute-0]
changed: [testohpc-compute-1]
TASK [ansible-role-openhpc : Template slurmdbd.conf] ****************************************************************************************************************************************************************************************
skipping: [testohpc-login-0]
skipping: [testohpc-compute-0]
skipping: [testohpc-compute-1]
TASK [ansible-role-openhpc : Apply customised SLURM configuration] **************************************************************************************************************************************************************************
skipping: [testohpc-compute-0]
skipping: [testohpc-compute-1]
changed: [testohpc-login-0]
TASK [ansible-role-openhpc : Set slurmctld location for configless operation] ***************************************************************************************************************************************************************
skipping: [testohpc-login-0]
changed: [testohpc-compute-0]
changed: [testohpc-compute-1]
RUNNING HANDLER [ansible-role-openhpc : Restart Munge service] ******************************************************************************************************************************************************************************
changed: [testohpc-compute-0]
changed: [testohpc-compute-1]
RUNNING HANDLER [ansible-role-openhpc : Reload SLURM service] *******************************************************************************************************************************************************************************
changed: [testohpc-login-0]
fatal: [testohpc-compute-0]: FAILED! => {
"changed": false
}
So we can see that:
@jovial be interested to discuss why the Restart Munge service
handler isn't running on the control node here.
On 00d83d7053ca27a1892aa72f2f801a3deadd11c6, I saw slurmd startup after instance creation fail with:
However issuing
sudo systemctl start slurmd
immediately after this succeeded. Some issue with name resolution or something after node startup?