stackhpc / ansible-role-openhpc

Ansible role for OpenHPC
Apache License 2.0
47 stars 16 forks source link

Add wait_for slurmdbd port #129

Closed m-bull closed 2 years ago

m-bull commented 2 years ago

Ensure that the slurmdbd service is accessible on its specified port after a restart before restarting any other services that might depend on slurmdbd being accessible.

This fixes an issue where slurmctld is raised before slurmdbd is responding on its port, causing systemctl restart slurmctld to fail with the following message to syslog:

Mar  3 17:02:58 matt-slurm-control-0 slurmctld[63748]: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Mar  3 17:02:58 matt-slurm-control-0 slurmctld[63748]: error: Sending PersistInit msg: Connection refused
Mar  3 17:02:58 matt-slurm-control-0 slurmctld[63748]: fatal: You are running with a database but for some reason we have no TRES from it.  This should only happen if the database is down and you don't have any state files.
Mar  3 17:02:58 matt-slurm-control-0 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Mar  3 17:02:58 matt-slurm-control-0 systemd[1]: slurmctld.service: Failed with result 'exit-code'.

This wait_for approach is already taken when restarting the slurmctld daemon.

m-bull commented 2 years ago

Tested new changes using Azimuth - problem is still fixed and appliances still provision!