kubernetes-sigs / lws

LeaderWorkerSet: An API for deploying a group of pods as a unit of replication
Apache License 2.0
124 stars 24 forks source link

leaderworker set cannot spin up group on openshift #172

Closed mohittalele closed 2 months ago

mohittalele commented 3 months ago

What happened: On openshift LWS was deployed using instructions given in repository. When I create a LWS CR, I expect leader and worker pods as described in documentation. However only leader pod is spin up and no worker pods are provisioned by lws. The controller logs below error -

    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
2024-07-02T12:26:12Z    ERROR   Reconciler error    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"leaderworkerset-multi-template-0","namespace":"test"}, "namespace": "test", "name": "leaderworkerset-multi-template-0", "reconcileID": "09c19774-ee4e-44a6-b08e-051c0610b3b6", "error": "statefulsets.apps \"leaderworkerset-multi-template-0\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
2024-07-02T12:26:32Z    ERROR   Reconciler error    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"leaderworkerset-multi-template-0","namespace":"test"}, "namespace": "test", "name": "leaderworkerset-multi-template-0", "reconcileID": "edfc66be-21b8-4b59-8c85-d27bec28f25f", "error": "statefulsets.apps \"leaderworkerset-multi-template-0\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227
2024-07-02T12:27:13Z    ERROR   Reconciler error    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"leaderworkerset-multi-template-0","namespace":"test"}, "namespace": "test", "name": "leaderworkerset-multi-template-0", "reconcileID": "86567ffb-99d6-45be-b865-3a92123e2bb0", "error": "statefulsets.apps \"leaderworkerset-multi-template-0\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227

What you expected to happen: I expect leader and worker groups to be in running set. How to reproduce it (as minimally and precisely as possible):

Use below template and create lws


apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: leaderworkerset-multi-template
spec:
  replicas: 1
  leaderWorkerTemplate:
    leaderTemplate:
      spec:
        containers:
        - name: busybox
          image: busybox
          env:
          - name: HOME 
            value: /tmp
          command: 
          - sh
          - -c
          - |
            sleep 3600
          resources:
            limits:
              cpu: "100m"
            requests:
              cpu: "50m"
          ports:
          - containerPort: 8080
    size: 4
    workerTemplate:
      spec:
        containers:
        - name: nginx
          image: busybox
          command: 
          - sh
          - -c
          - |
            sleep 3600
          resources:
            limits:
              cpu: "100m"
            requests:
              cpu: "50m"
          ports:
          - containerPort: 8080

Anything else we need to know?:

Environment:

liurupeng commented 3 months ago

For each replica, LWS will create a leader pod at first, after the leader pos is scheduled, it will create a worker statefulset, and set a owner reference to the leader pod. This may caused some issue when running on openshift, seems it doesn't allow mutating that resource.

mohittalele commented 3 months ago

@liurupeng thanks for your explanation. Yes i suspect the openshift is unable to mutate the required resources. Can you elaborate which permissions and specific role needs to be changed in order it work.

mohittalele commented 2 months ago

@liurupeng Any pointers would really appreciated :)

liurupeng commented 2 months ago

Hi @mohittalele this is the required rbac roles: https://github.com/kubernetes-sigs/lws/blob/main/config/rbac/role.yaml @Edwinhr716 could you reproduce this issue and see if with a proper permission we can run the controller on openshift

mohittalele commented 2 months ago

@liurupeng thanks for your reply. I get same errors even after updating rbac roles as given in file. This is test example I used

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: leaderworkerset-multi-template
  namespace: dl-llmaas-test
spec:
  leaderWorkerTemplate:
    leaderTemplate:
      metadata: {}
      spec:
        containers:
        - command:
          - sh
          - -c
          - |
            sleep 3600
          env:
          - name: HOME
            value: /tmp
          image: busybox
          name: busybox
          ports:
          - containerPort: 8080
            protocol: TCP
          resources:
            limits:
              cpu: 100m
            requests:
              cpu: 50m
    restartPolicy: Default
    size: 4
    workerTemplate:
      metadata: {}
      spec:
        containers:
        - command:
          - sh
          - -c
          - |
            sleep 3600
          image: busybox
          name: nginx
          ports:
          - containerPort: 8080
            protocol: TCP
          resources:
            limits:
              cpu: 100m
            requests:
              cpu: 50m
  replicas: 2
  rolloutStrategy:
    rollingUpdateConfiguration:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
  startupPolicy: LeaderCreated
kannon92 commented 2 months ago

/assign

kannon92 commented 2 months ago

Can confirm that I see the same behavior on 4.15 as user.