stackhpc / slurm-k8s-cluster

A Slurm cluster for Kubernetes
MIT License
36 stars 12 forks source link

Fix slurmd pod recreation via k8s dns and cloud nodes #13

Closed sjpb closed 1 year ago

sjpb commented 1 year ago

Slurmd pods can be deleted and recreated without breaking node resolution. Changes are:

  1. Changing slurmd to be a statefulset, so that they have predictable names (also makes e.g. sinfo use shorter hostlist expressions, which is nice).
  2. Create a headless service for slurmd pods, and set the statefulset service to this service (see docs for headless services, StatefulSet and DNS, noting the differences for headless services in the last).
  3. Populate dnsConfig in all pod templates with a search domain, such that pod short names resolve.
  4. Change Slurm nodes to be Cloud nodes in FUTURE state, rather than "plain" dynamic nodes: requires defining nodes in slurm.conf, and then passing slurmd the -F option rather than -Z.
  5. Configure Slurm to use the "cloud" DNS, with no caching.

NB: #5 is important, without that this doesn't seem to work.

Example shell output below shows cluster surviving slurmd pod deletion/recreation at 3c2ba62. Note:

# Show nodes are up:
bash-4.4$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2   idle slurmd-[0-1]
# Note node addresses are names, not IPs:
bash-4.4$ scontrol show nodes | grep NodeAddr
   NodeAddr=slurmd-0 NodeHostName=slurmd-0 Version=23.02.3
   NodeAddr=slurmd-1 NodeHostName=slurmd-1 Version=23.02.3  

# Show a job runs ok, and pre-deletion IPs:
bash-4.4$ srun -N2 bash -c 'ip a | grep "inet 172"'
    inet 172.17.56.106/32 scope global eth0
    inet 172.20.253.28/32 scope global eth0

# Above job completed cleanly:
bash-4.4$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
# Delete slurmd pods:
[rocky@steveb-docker slurm-docker-cluster]$ k delete statefulset slurmd

# Effect on slurm:
[2023-07-18T19:08:19.160318]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2  idle* slurmd-[0-1]

[2023-07-18T19:08:33.234313]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2  down* slurmd-[0-1]

# Recreate nodes:
[rocky@steveb-docker slurm-docker-cluster]$ helm upgrade sbtest slurm-cluster-chart/

# Can see node state "bounces" a bit in slurm but looks ok quite quickly:
[2023-07-18T19:09:05.419305]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1  down* slurmd-1
all*         up   infinite      1   idle slurmd-0

[2023-07-18T19:09:07.435324]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2   idle slurmd-[0-1]

[2023-07-18T19:09:11.456700]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1  idle* slurmd-0
all*         up   infinite      1   idle slurmd-1

[2023-07-18T19:09:13.466858]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2   idle slurmd-[0-1]

[2023-07-18T19:09:17.488950]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2  idle* slurmd-[0-1]

[2023-07-18T19:09:19.500050]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2   idle slurmd-[0-1]

# Show IP addresses have changed and a job runs OK:
bash-4.4$ srun -N2 bash -c 'ip a | grep "inet 172"'
    inet 172.17.56.111/32 scope global eth0
    inet 172.20.253.10/32 scope global eth0

# Show job completed cleanly
bash-4.4$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Note that point 4 above (using FUTURE node states) seems to be important. Without that (i.e. without defining nodes in slurm.conf and using slurmd -Z ...) we get this:

bash-4.4$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2   idle slurmd-[0-1]

# Note node addresses now shown as IPs - and these are correct pod IPs:
bash-4.4$ scontrol show nodes | grep NodeAddr
   NodeAddr=172.17.56.112 NodeHostName=slurmd-1 Version=23.02.3
   NodeAddr=172.20.253.15 NodeHostName=slurmd-0 Version=23.02.3

# ... but job launch fails:
bash-4.4$ srun -N2 bash -c 'ip a | grep "inet 172"'
srun: error: fwd_tree_thread: can't find address for host slurmd-0, check slurm.conf
srun: error: fwd_tree_thread: can't find address for host slurmd-1, check slurm.conf
srun: error: Task launch for StepId=2.0 failed on node slurmd-1: Can't find an address, check slurm.conf
srun: error: Task launch for StepId=2.0 failed on node slurmd-0: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted

@wtripp180901 can you review this please and test it properly; in particular I've changed the slurmd timeout to 30s just for ease of testing, but I don't really know if changing this from the default is required/important. When setting up this PR I did launch a job before the new slurmds had finished "bouncing" and the job just went into waiting for the correct node state once, which is IMO the correct behaviour, but this could do with more testing.

sjpb commented 1 year ago

@wtripp180901 some more thoughts:

sjpb commented 1 year ago

SlurmdTimeout is specified twice in slurm.conf