Open t1mk1k opened 5 months ago
@t1mk1k can you try newest build and let me know if that resolves?
@vultj I've changed the image tags in the Helm chart to v0.0.2
and now a cluster is deployed successfully.
However, one of the deployments is still trying to schedule a pod on a control plane node:
✗ kubectl get pods -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
slik-operator-6bf7848d88-wqxqh 1/1 Running 0 102m 10.244.1.4 kind-worker <none> <none>
test-kind-control-plane-78c96648b7-jms5g 0/2 Pending 0 99m <none> <none> <none> <none>
test-kind-worker-cb684697f-9bfhr 2/2 Running 0 99m 10.244.1.7 kind-worker <none> <none>
test-slurm-toolbox-64bf746f8c-7s2pl 2/2 Running 0 99m 10.244.1.8 kind-worker <none> <none>
test-slurmabler-kz6xb 1/1 Running 0 99m 10.244.1.5 kind-worker <none> <none>
test-slurmctld-dbfcd569f-r7cfd 2/2 Running 0 99m 10.244.1.6 kind-worker <none> <none>
✗ kubectl get deployments.apps test-kind-control-plane -owide
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
test-kind-control-plane 0/1 1 0 102m slurmd ewr.vultrcr.com/slurm/slurmd:v0.0.2 app=test-slurmd,host=kind-control-plane
It does not appear to affect the SLURM cluster, but probably good to filter out nodes with taints at the deployment stage too.
These issues are worse when your cluster has specific requirements for what is allowed to deploy on it. For example, on OpenShift, infrastructure taints have to be tolerated to be in license compliance. So ideally, you would want a way to tell the operator to not deploy on certain nodes.
Likely, but I welcome any PRs that add such functionality.
Describe the bug The slinkee operator gets stuck deploying a slinkee cluster when there are nodes with taints that will not have a slurmabler deployed on them. The reason is that the operator waits until each node has been labelled, however the slurmabler will not be scheduled on nodes with a taint.
In many Kubernetes clusters the control plane will be tainted so regular workloads cannot be scheduled on them, so on many Kubernetes clusters this will be an issue.
To Reproduce Steps to reproduce the behavior using a Kind cluster:
/tmp/one-node-kind.yml
:Expected behavior
The slinkee operator should ignore the nodes that do not have a slurmabler scheduled on them because of taints.
Additional context
Commit of repo used for testing: 55388067c0a7469742bec558c9c8dc52fcbc9570 Kind version: kind v0.22.0 go1.21.7 darwin/arm64