Requires KUBECONFIG to be defined in the shell used to run helm. This is injected as a secret to use to create pods on demand.
Uses host network for slurmd pods. These pods have a hostPort defined (for slurmd) which means they don't get scheduled onto the same k-node.
Note there's no ResumeFailProgram defined; assuming the slurmd pod definition is ok the the most likely reason for "resume failure" from slurm's PoV is the pod pending beyond ResumeTimeout, due to e.g. not enough k-nodes. While the s-node does then show DOWN, if k-resources become available later the pod will launch, at which time the s-node changes from DOWN to IDLE (this has been tested). Note this is different from e.g. autoscale on openstack, where a failed VM launch due to cloud resources will not keep retrying and the s-node state needs to be reset to allow slurm to try to launch it again. So really leaving the s-node showing as DOWN accurately reflects the state of cloud resources here.
Requires
KUBECONFIG
to be defined in the shell used to runhelm
. This is injected as a secret to use to create pods on demand.Uses host network for slurmd pods. These pods have a hostPort defined (for slurmd) which means they don't get scheduled onto the same k-node.
Note there's no ResumeFailProgram defined; assuming the slurmd pod definition is ok the the most likely reason for "resume failure" from slurm's PoV is the pod pending beyond ResumeTimeout, due to e.g. not enough k-nodes. While the s-node does then show DOWN, if k-resources become available later the pod will launch, at which time the s-node changes from DOWN to IDLE (this has been tested). Note this is different from e.g. autoscale on openstack, where a failed VM launch due to cloud resources will not keep retrying and the s-node state needs to be reset to allow slurm to try to launch it again. So really leaving the s-node showing as DOWN accurately reflects the state of cloud resources here.