vultr / slik

Slurm in Kubernetes
https://vultr.com
Apache License 2.0
39 stars 6 forks source link

[BUG] - Operator gets stuck deploying a slinkee cluster when nodes have taints (e.g,. control plane nodes) #11

Open t1mk1k opened 5 months ago

t1mk1k commented 5 months ago

Describe the bug The slinkee operator gets stuck deploying a slinkee cluster when there are nodes with taints that will not have a slurmabler deployed on them. The reason is that the operator waits until each node has been labelled, however the slurmabler will not be scheduled on nodes with a taint.

In many Kubernetes clusters the control plane will be tainted so regular workloads cannot be scheduled on them, so on many Kubernetes clusters this will be an issue.

To Reproduce Steps to reproduce the behavior using a Kind cluster:

  1. Create a kind cluster config for one control plane node and one worker node in /tmp/one-node-kind.yml:
    kind: Cluster
    apiVersion: kind.x-k8s.io/v1alpha4
    nodes:
    - role: control-plane
    - role: worker
  2. Create a kind cluster with the config:
    kind create cluster --config /tmp/one-node-kind.yml
  3. Deploy slinkee via helm
    helm install -f helm/slinkee/values.yaml slinkee ./helm/slinkee/
  4. Deploy the simple slinkee cluster
    kubectl apply -f payloads/simple.yaml 
  5. Wait for the slinkee-operator to create the slurm-ablers and observe how the worker node will get labels and the control plane node will not:
    kubectl get nodes --show-labels
  6. In the logs of the slinkee-operator you can see it is waiting for nodes to be labelled:
    kubectl logs slinkee-operator-767fb59df6-7w66j --tail 10
    2024-06-21T11:34:52.987Z    INFO    slurm/create_slurmabler.go:102  github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet node lacking labels...  {"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
    2024-06-21T11:34:53.993Z    INFO    slurm/create_slurmabler.go:102  github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet node lacking labels...  {"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
    2024-06-21T11:34:54.999Z    INFO    slurm/create_slurmabler.go:102  github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet node lacking labels...  {"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
    2024-06-21T11:34:56.006Z    INFO    slurm/create_slurmabler.go:102  github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet node lacking labels...  {"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
    2024-06-21T11:34:57.012Z    INFO    slurm/create_slurmabler.go:102  github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet node lacking labels...  {"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
    2024-06-21T11:34:58.020Z    INFO    slurm/create_slurmabler.go:102  github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet node lacking labels...  {"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
    2024-06-21T11:34:59.027Z    INFO    slurm/create_slurmabler.go:102  github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet node lacking labels...  {"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
    2024-06-21T11:35:00.033Z    INFO    slurm/create_slurmabler.go:102  github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet node lacking labels...  {"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
    2024-06-21T11:35:01.039Z    INFO    slurm/create_slurmabler.go:102  github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet node lacking labels...  {"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
    2024-06-21T11:35:02.046Z    INFO    slurm/create_slurmabler.go:102  github.com/vultr/slinkee/pkg/slurm.buildSlurmablerDaemonSet node lacking labels...  {"host": "10.99.172.138", "hostname": "slinkee-operator-767fb59df6-7w66j", "pid": 1}
  7. Remove the taint from the control plane node:
    kubectl taint node kind-control-plane node-role.kubernetes.io/control-plane:NoSchedule-
  8. Now watch the slinkee simple cluster being deployed
    kubectl get pods -n default -w

Expected behavior

The slinkee operator should ignore the nodes that do not have a slurmabler scheduled on them because of taints.

Additional context

Commit of repo used for testing: 55388067c0a7469742bec558c9c8dc52fcbc9570 Kind version: kind v0.22.0 go1.21.7 darwin/arm64

vultj commented 5 months ago

@t1mk1k can you try newest build and let me know if that resolves?

t1mk1k commented 5 months ago

@vultj I've changed the image tags in the Helm chart to v0.0.2 and now a cluster is deployed successfully.

However, one of the deployments is still trying to schedule a pod on a control plane node:

✗ kubectl get pods -owide                                     
NAME                                       READY   STATUS    RESTARTS   AGE    IP           NODE          NOMINATED NODE   READINESS GATES
slik-operator-6bf7848d88-wqxqh             1/1     Running   0          102m   10.244.1.4   kind-worker   <none>           <none>
test-kind-control-plane-78c96648b7-jms5g   0/2     Pending   0          99m    <none>       <none>        <none>           <none>
test-kind-worker-cb684697f-9bfhr           2/2     Running   0          99m    10.244.1.7   kind-worker   <none>           <none>
test-slurm-toolbox-64bf746f8c-7s2pl        2/2     Running   0          99m    10.244.1.8   kind-worker   <none>           <none>
test-slurmabler-kz6xb                      1/1     Running   0          99m    10.244.1.5   kind-worker   <none>           <none>
test-slurmctld-dbfcd569f-r7cfd             2/2     Running   0          99m    10.244.1.6   kind-worker   <none>           <none>
✗ kubectl get deployments.apps test-kind-control-plane -owide
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE    CONTAINERS   IMAGES                                SELECTOR
test-kind-control-plane   0/1     1            0           102m   slurmd       ewr.vultrcr.com/slurm/slurmd:v0.0.2   app=test-slurmd,host=kind-control-plane

It does not appear to affect the SLURM cluster, but probably good to filter out nodes with taints at the deployment stage too.

odellem commented 3 months ago

These issues are worse when your cluster has specific requirements for what is allowed to deploy on it. For example, on OpenShift, infrastructure taints have to be tolerated to be in license compliance. So ideally, you would want a way to tell the operator to not deploy on certain nodes.

vultj commented 3 months ago

Likely, but I welcome any PRs that add such functionality.