medik8s / fence-agents-remediation

Kubernetes Operator for providing high availability between nodes by automatically remediating them using well-known fence-agents.
https://www.medik8s.io/
Apache License 2.0
9 stars 8 forks source link

Add missing RBAC permissions #101

Closed jcanocan closed 1 year ago

jcanocan commented 1 year ago

When FAR is used with fence_azure_arm fence agent. It requires permissions for watching and listing the namespace objects for repairing unhealthy nodes. If the operator does not hold these permissions, the following error is shown and NHC stays in Remediating status :

2023-10-25T09:36:47.33411147Z   INFO    controllers.FenceAgentsRemediation  Manual workload deletion    {"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-9fnfq-worker-germanywestcentral1-m5nwt"}
W1025 09:36:47.336315       1 reflector.go:533] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:openshift-operators:fence-agents-remediation-controller-manager" cannot list resource "namespaces" in API group "" at the cluster scope
E1025 09:36:47.336350       1 reflector.go:148] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233: Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:openshift-operators:fence-agents-remediation-controller-manager" cannot list resource "namespaces" in API group "" at the cluster scope

It adds verbs list and watch verbs for namespace resources.

openshift-ci[bot] commented 1 year ago

Hi @jcanocan. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
razo7 commented 1 year ago

Sounds reasonable that these RBAC permissions should be added (as SNR needed them as well), but I believe the get variable should be added as well. See old discussion for a similar idea on the notion of using get;list;watch rather than just list;watch.

Thank you for catching that and for participating.

  1. Can you use get as well or try to verify the error for different combinations?
  2. Please also run make bundle in order to re-generate the CSV with those changes.
jcanocan commented 1 year ago

Sounds reasonable that these RBAC permissions should be added (as SNR needed them as well), but I believe the get variable should be added as well. See old discussion for a similar idea on the notion of using get;list;watch rather than just list;watch.

Thank you for catching that and for participating.

Glad to contribute with the project. It is very interesting and promising :)

1. Can you use `get` as well or try to verify the error for [different combinations](https://github.com/medik8s/fence-agents-remediation/pull/48#discussion_r1191054622)?

Sure! Here are the results:

  1. With just get:
    2023-10-25T15:01:18.293589335Z  INFO    controllers.FenceAgentsRemediation  Manual workload deletion    {"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-9fnfq-worker-germanywestcentral3-nmbx8"}
    W1025 15:01:18.295546       1 reflector.go:535] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:openshift-operators:fence-agents-remediation-controller-manager" cannot list resource "namespaces" in API group "" at the cluster scope
    E1025 15:01:18.295576       1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:openshift-operators:fence-agents-remediation-controller-manager" cannot list resource "namespaces" in API group "" at the cluster scope
  2. get and watch:
    2023-10-25T15:16:05.300798484Z  INFO    controllers.FenceAgentsRemediation  Manual workload deletion    {"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-9fnfq-worker-germanywestcentral3-nmbx8"}
    W1025 15:16:05.302447       1 reflector.go:535] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:openshift-operators:fence-agents-remediation-controller-manager" cannot list resource "namespaces" in API group "" at the cluster scope
    E1025 15:16:05.302484       1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:openshift-operators:fence-agents-remediation-controller-manager" cannot list resource "namespaces" in API group "" at the cluster scope

    It's a bit strange that it complains about watch besides the fact that it has been added...

  3. get and list:
    E1025 15:21:57.247278       1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: Failed to watch *v1.Namespace: unknown (get namespaces)

    However, just using get and list FAR is able to reconcile and finish the job.

  4. list gets same result as 3).

In conclusion, it looks like the only verb really required is list. Nevertheless, it throws an error but end the execution successfully. So, should we also add the watch verb then? WDYT?

2. Please also run `make bundle` in order to re-generate the CSV with those changes.

I will when we have a decision about if we want to add just get and list or get, list and watch. Thanks for letting me know.

razo7 commented 1 year ago

Sure! Here are the results:

That was fast :)

In conclusion, it looks like the only verb really required is list. Nevertheless, it throws an error but end the execution successfully.

For option 0 (list;watch), option 3 (get;list), and option 4(list) the reconciliation was executed successfully but with a warning. How is it for option 5 (get;list;watch)? I believe it would finish without a warning

jcanocan commented 1 year ago

That was fast :)

Thanks :)

In conclusion, it looks like the only verb really required is list. Nevertheless, it throws an error but end the execution successfully.

For option 0 (list;watch), option 3 (get;list), and option 4(list) the reconciliation was executed successfully but with a warning. How is it for option 5 (get;list;watch)? I believe it would finish without a warning

Yes, it finishes without any warning.

razo7 commented 1 year ago

Then, let's continue with option 5

jcanocan commented 1 year ago

v2:

razo7 commented 1 year ago

/ok-to-test

razo7 commented 1 year ago

Looks good, thanks again for submitting the PR /approve /lgtm

openshift-ci[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcanocan, razo7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/medik8s/fence-agents-remediation/blob/main/OWNERS)~~ [razo7] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment