stackhpc / ansible-role-openhpc

Ansible role for OpenHPC
Apache License 2.0
45 stars 15 forks source link

Provide error messages on failure to start slurm daemons #95

Open sjpb opened 3 years ago

sjpb commented 3 years ago

If this fails, then journalctl or systemctl status might well have useful info, e.g. if you specify two partitions which share nodes (which is legal to slurm, but isn't handled by our current templating) then:

but actually the control node shows:

$ sudo journalctl -xe
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-2 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-3 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: fatal: Duplicated NodeHostName nrel-hpc-0 in config file

and

$ sudo systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2021-02-24 09:21:38 UTC; 5min ago
  Process: 26178 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 26180 (code=exited, status=1/FAILURE)

Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: layouts: no layout to initialize
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-2 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-hpc-3 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-0 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: error: Duplicated NodeHostName nrel-express-1 in the config file
Feb 24 09:21:38 nrel-control.novalocal slurmctld[26180]: fatal: Duplicated NodeHostName nrel-hpc-0 in config file
Feb 24 09:21:38 nrel-control.novalocal systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Feb 24 09:21:38 nrel-control.novalocal systemd[1]: slurmctld.service: Failed with result 'exit-code'.