kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
420 stars 211 forks source link

Need to add "coordination.k8s.io" and "-leases" to clusterrole #508

Closed VincentDu2021 closed 1 year ago

VincentDu2021 commented 1 year ago

Seeing this error after applying the mpi-operator/deploy/v2beta1/mpi-operator.yaml:

E0126 19:02:25.423446       1 leaderelection.go:330] error retrieving resource lock mpi-operator/mpi-operator: leases.coordination.k8s.io "mpi-operator" is forbidden: User "system:serviceaccount:mpi-operator:mpi-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "mpi-operator"

After adding the 'coordination.k8s.io' and 'leases' into the clusterrole, the issue is gone. I can submit a PR if you guys prefer.

423 - apiGroups:
424   - scheduling.incubator.k8s.io
425   - scheduling.sigs.dev
426   - coordination.k8s.io
427   resources:
428   - queues
429   - podgroups
430   - leases
431   verbs:
432   - '*'
tenzen-y commented 1 year ago

@VincentDu2021 Thanks for creating this! Maybe, I forgot to re-generate deploy/v2beta1/mpi-operator.yaml. Can you try to run kubectl apply -k manifests/overlays/standalone/?

VincentDu2021 commented 1 year ago

@VincentDu2021 Thanks for creating this! Maybe, I forgot to re-generate deploy/v2beta1/mpi-operator.yaml. Can you try to run kubectl apply -k manifests/overlays/standalone/?

Yeah, I see in the /manifests/base/mpi-operator.yaml you got them included. NIP: do you need "*" for the verbs? In my test I only included "get, list, update" and it seems working just fine:

112 - apiGroups:
113   - coordination.k8s.io
114   resources:
115   - leases
116   verbs:
117   - get
118   - list
119   - update
tenzen-y commented 1 year ago

@VincentDu2021 Thanks for creating this! Maybe, I forgot to re-generate deploy/v2beta1/mpi-operator.yaml. Can you try to run kubectl apply -k manifests/overlays/standalone/?

Yeah, I see in the /manifests/base/mpi-operator.yaml you got them included. NIP: do you need "*" for the verbs? In my test I only included "get, list, update" and it seems working just fine:

112 - apiGroups:
113   - coordination.k8s.io
114   resources:
115   - leases
116   verbs:
117   - get
118   - list
119   - update

We need "*" for the verbs in mpioperator/mpi-operator:master image. I guess you might be using the mpioperator/mpi-operator:latest image. Our latest tag is a bit old. Can you check the mpi-operator image with kubectl get deployments mpi-operator -ojsonpath="{.spec.template.spec.containers[0].image}"?

VincentDu2021 commented 1 year ago

kubectl get deployments mpi-operator -ojsonpath="{.spec.template.spec.containers[0].image}" Yes, I am using the "latest" tag : "mpioperator/mpi-operator:latest"

tenzen-y commented 1 year ago

kubectl get deployments mpi-operator -ojsonpath="{.spec.template.spec.containers[0].image}" Yes, I am using the "latest" tag : "mpioperator/mpi-operator:latest"

Can you use mpioperator/mpi-operator:master image? Maybe, mpi-operator outputs errors in logs if you use your suggestion ClusterRole.

VincentDu2021 commented 1 year ago

kubectl get deployments mpi-operator -ojsonpath="{.spec.template.spec.containers[0].image}" Yes, I am using the "latest" tag : "mpioperator/mpi-operator:latest"

Can you use mpioperator/mpi-operator:master image? Maybe, mpi-operator outputs errors in logs if you use your suggestion ClusterRole.

You are correct, @tenzen-y . Using image with "master" tag, if I only included "get, list, update", I'd get error:

E0127 20:43:54.084766 1 leaderelection.go:334] error initially creating leader election record: leases.coordination.k8s.io is forbidden: User "system:serviceaccount:smoke-test:mpi-operator" cannot create resource "leases" in API group "coordination.k8s.io" in the namespace "smoke-test"

Notice I changed namespace, after adding "create" verb into the list, this error is gone, but since we need create it makes sense to add "delete" etc., which basically equals using "*".

I think this issue can be resolved. Thanks for your assistance.

tenzen-y commented 1 year ago

I think this issue can be resolved. Thanks for your assistance.

I would close this issue once we merge #508 to the master branch.

Thanks for your report.

tenzen-y commented 1 year ago

/assign /kind bug