Configless mode failed to start slurmd

sjpb commented 3 years ago

On 00d83d7053ca27a1892aa72f2f801a3deadd11c6, I saw slurmd startup after instance creation fail with:

[centos@testohpc-compute-2 ~]$ systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: failed (Result: timeout) since Tue 2020-11-24 09:56:20 UTC; 24s ago
  Process: 26914 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=killed, signal=TERM)

Nov 24 09:54:50 testohpc-compute-2.novalocal systemd[1]: Starting Slurm node daemon...
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: slurmd.service: Start operation timed out. Terminating.
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: slurmd.service: Failed with result 'timeout'.
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: Failed to start Slurm node daemon.
[centos@testohpc-compute-2 ~]$ journalctl -xe
Nov 24 09:54:52 testohpc-compute-2.novalocal systemd[1]: Started man-db-cache-update.service.
-- Subject: Unit man-db-cache-update.service has finished start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit man-db-cache-update.service has finished starting up.
-- 
-- The start-up result is done.
Nov 24 09:54:59 testohpc-compute-2.novalocal slurmd[27015]: error: _fetch_child: failed to fetch remote configs
Nov 24 09:56:16 testohpc-compute-2.novalocal systemd[4426]: Starting Mark boot as successful...
-- Subject: Unit UNIT has begun start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit UNIT has begun starting up.
Nov 24 09:56:16 testohpc-compute-2.novalocal systemd[4426]: Started Mark boot as successful.
-- Subject: Unit UNIT has finished start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit UNIT has finished starting up.
-- 
-- The start-up result is done.
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: slurmd.service: Start operation timed out. Terminating.
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: slurmd.service: Failed with result 'timeout'.
Nov 24 09:56:20 testohpc-compute-2.novalocal systemd[1]: Failed to start Slurm node daemon.

However issuing sudo systemctl start slurmd immediately after this succeeded. Some issue with name resolution or something after node startup?

sjpb commented 3 years ago

This is definitely repeatable as of ce57ea1. Things which DON'T fix it:

Adding systemctl daemon-reload before the reload/restart
Adding retries 3 / delay 20 to Reload SLURM service

Although slurmCTLD says its started actually there's a munge failure - with odd date too:

Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: error: slurm_unpack_received_msg: Protocol authentication error
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: error: slurm_receive_msg [10.60.253.64:49676]: Unspecified error
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: error: If munged is up, restart with --num-threads=10
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: error: Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: ENCODED: Wed Dec 31 23:59:59 1969
Dec 11 16:54:22 testohpc-login-0.novalocal slurmctld[16117]: DECODED: Wed Dec 31 23:59:59 1969

But if I restart slurmCTLD this goes away then I can restart slurmD ok.

sjpb commented 3 years ago

Ok here's the problem:

TASK [ansible-role-openhpc : Write Munge key] ***********************************************************************************************************************************************************************************************
ok: [testohpc-login-0]
changed: [testohpc-compute-0]
changed: [testohpc-compute-1]

TASK [ansible-role-openhpc : Template slurmdbd.conf] ****************************************************************************************************************************************************************************************
skipping: [testohpc-login-0]
skipping: [testohpc-compute-0]
skipping: [testohpc-compute-1]

TASK [ansible-role-openhpc : Apply customised SLURM configuration] **************************************************************************************************************************************************************************
skipping: [testohpc-compute-0]
skipping: [testohpc-compute-1]
changed: [testohpc-login-0]

TASK [ansible-role-openhpc : Set slurmctld location for configless operation] ***************************************************************************************************************************************************************
skipping: [testohpc-login-0]
changed: [testohpc-compute-0]
changed: [testohpc-compute-1]

RUNNING HANDLER [ansible-role-openhpc : Restart Munge service] ******************************************************************************************************************************************************************************
changed: [testohpc-compute-0]
changed: [testohpc-compute-1]

RUNNING HANDLER [ansible-role-openhpc : Reload SLURM service] *******************************************************************************************************************************************************************************
changed: [testohpc-login-0]
fatal: [testohpc-compute-0]: FAILED! => {
    "changed": false
}

So we can see that:

munge key IS changed on control node - this should notify the munged restart handler
flush handlers then gets called after "Set slurmctld location for configless.."
but Restart Munge service ONLY happens on compute notes - WHY?

sjpb commented 3 years ago

@jovial be interested to discuss why the Restart Munge service handler isn't running on the control node here.

stackhpc / ansible-role-openhpc

Configless mode failed to start slurmd #72