stackhpc / slurm-k8s-cluster

A Slurm cluster for Kubernetes
MIT License
46 stars 20 forks source link

Add Slurm autoscaling #18

Open sjpb opened 1 year ago

sjpb commented 1 year ago

Requires KUBECONFIG to be defined in the shell used to run helm. This is injected as a secret to use to create pods on demand.

Uses host network for slurmd pods. These pods have a hostPort defined (for slurmd) which means they don't get scheduled onto the same k-node.

Note there's no ResumeFailProgram defined; assuming the slurmd pod definition is ok the the most likely reason for "resume failure" from slurm's PoV is the pod pending beyond ResumeTimeout, due to e.g. not enough k-nodes. While the s-node does then show DOWN, if k-resources become available later the pod will launch, at which time the s-node changes from DOWN to IDLE (this has been tested). Note this is different from e.g. autoscale on openstack, where a failed VM launch due to cloud resources will not keep retrying and the s-node state needs to be reset to allow slurm to try to launch it again. So really leaving the s-node showing as DOWN accurately reflects the state of cloud resources here.

sjpb commented 1 year ago

NB: this CANNOT make use of #27