Closed sjpb closed 1 year ago
@wtripp180901 some more thoughts:
I'd suggest making the MaxNodeCount
and node definition in slurm.conf be templated from some helm chart value for max nodes. There used to be a PrivateData=cloud
option to hide "non-powered-up" nodes but it appears that is now the default, so I'd be interested to know what sinfo
looks like with e.g. 2 of 10 possible nodes launched.
Reading this I'm also not sure we need fanout comms disabling via TreeWidth
any more?
It'd be good to check (via nslookup
or something, dnf install -y bind-utils
) if the pod DNS entries are immediately changed on pod deletion/recreation or whether the cached value has to expire.
Maybe set ReturnToService=2
in slurm.conf?
SlurmdTimeout
is specified twice in slurm.conf
Slurmd pods can be deleted and recreated without breaking node resolution. Changes are:
sinfo
use shorter hostlist expressions, which is nice).dnsConfig
in all pod templates with a search domain, such that pod short names resolve.slurm.conf
, and then passingslurmd
the-F
option rather than-Z
.NB: #5 is important, without that this doesn't seem to work.
Example shell output below shows cluster surviving slurmd pod deletion/recreation at 3c2ba62. Note:
bash-4.4$
prompt isrocky
user in login pod (exec'd in, not SSH due to cloud FIP problems)/slurm-docker-cluster]$
prompt is onkubectl
host.ewatch
is https://github.com/sjpb/ewatch, which provides timestamped view of changes to command passed at 2s (by default) intervals.Note that point 4 above (using FUTURE node states) seems to be important. Without that (i.e. without defining nodes in slurm.conf and using
slurmd -Z ...
) we get this:@wtripp180901 can you review this please and test it properly; in particular I've changed the slurmd timeout to 30s just for ease of testing, but I don't really know if changing this from the default is required/important. When setting up this PR I did launch a job before the new slurmds had finished "bouncing" and the job just went into waiting for the correct node state once, which is IMO the correct behaviour, but this could do with more testing.