stackhpc / ansible-role-openhpc

Ansible role for OpenHPC
Apache License 2.0
45 stars 15 forks source link

Ensure slurmctld actually up before completing restart handler #105

Closed sjpb closed 3 years ago

sjpb commented 3 years ago

At NREL we saw slurm fall over on startup. Hypothesis was that slurmctld startup was taking a while (due slurm.conf taking ~4s to read) but the unit returns immediately. Hence slurmds started and couldn't contact slurmctld, so went down.

Fix checks that slurmctld port is open before exiting the restart handler. 10s delay is to wait for slurmctld to go down. Ran OK on NREL.