stackhpc / ansible-role-openhpc

Ansible role for OpenHPC
Apache License 2.0
45 stars 15 forks source link

Support less-noisy reconfigure #90

Open sjpb opened 3 years ago

sjpb commented 3 years ago

With an image-based deploy the current workflow for adding a node looks like:

  1. Boot a new compute node. It will attempt to join the cluster, slurmctld will say it doesn't have a nodename entry, and slurmd will die.
  2. Run the role on the ENTIRE cluster, so that:
    • new slurm.conf generated including the new node
    • slurmctld and ALL slurmd restarted (inc. the new, failed one) in the correct order

Item 2 is really noisy as all the compute nodes run all the ansible. It would be good if really we could just run the appropriate steps for these cases.

I think the cases covered are:

We probably could do something just using the configure tag, but this needs testing/documenting.