stackhpc / ansible-role-openhpc

Ansible role for OpenHPC
Apache License 2.0
45 stars 15 forks source link

Adding nodes to running cluster fails in configless mode. #84

Closed sjpb closed 3 years ago

sjpb commented 3 years ago

In this mode, slurmd's have to contact slurmctl for the config, which means they need to be defined in the running config.

Currently:

  1. The changed slurm.conf on disk triggers a handler to reload slurmctld - this task is pending the end of the play
  2. The "Configure Slurm service" task then runs which will ensure slurm[d/ctld] is enabled and running: i. On the control node it's a no-op ii. On the new compute node it tries to start slurmd, which fails with fatal: Unable to determine this slurmd's NodeName as its only defined in the on-disk config, not the running one.
  3. The end of the play is reached, the handler task then runs to reload slurmctld. This also fails b/c the running and file configs have different numbers of nodes. For some reason this doesn't show as a failure in ansible.
sjpb commented 3 years ago

Tests added in fb42d70

sjpb commented 3 years ago

Note that adding a node to the cluster requires restart of all slurmctld/slurmds which this role has (deliberately) never done. So that behaviour is not necessarily a problem. However taking out the slurmctld isn't acceptable.

sjpb commented 3 years ago

Confirmed this is not configless-specific - slurmctld failure occurs when handlers run at end-of-play with default configless=false too.

sjpb commented 3 years ago

Re. the failure to reload the slurmctld (3 above) not showing up in ansible, see my bugreport for the openhpc systemd unit here. "Interestingly" reloading slurmctld with fewer nodes than running only generates a warning rather than taking down the daemon.