medik8s / node-healthcheck-operator

K8s Node Health Check Operator
Apache License 2.0
88 stars 18 forks source link

Disable caching for namespaces #308

Closed CecileRobertMichon closed 5 months ago

CecileRobertMichon commented 5 months ago

Why we need this PR

The node-healthcheck-manager-role ClusterRole is missing permissions to list and watch namespaces, causing the following errors in the manager logs from controller-runtime:

W0402 04:19:08.012105       1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:node-healthcheck:node-healthcheck-controller-manager" cannot list resource "namespaces" in API group "" at the cluster scope
E0402 04:19:08.012141       1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:node-healthcheck:node-healthcheck-controller-manager" cannot list resource "namespaces" in API group "" at the cluster scope

This PR disables the cache option for Namespaces according to https://sdk.operatorframework.io/docs/faqs/#after-deploying-my-operator-i-see-errors-like-failed-to-watch-external-type.

Changes made

Which issue(s) this PR fixes

ECOPROJECT-1940

Test plan

openshift-ci[bot] commented 5 months ago

Hi @CecileRobertMichon. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
slintes commented 5 months ago

Hi @CecileRobertMichon! Nice catch, we missed those log lines so far.

The reason for the error is that we use get and create for a common namespace for leases, which was added not too long ago. By default controller-runtime adds every resource we get to its cache, which then needs list and watch.

We would prefer though to not extend permissions in this case. Listing namespaces can reveal information about the cluster which we would'n have without that permission.

Instead we would prefer to disable caching for namespaces. That is possible with the ClientDisableCacheFor option when creating the manager in main.go, see https://sdk.operatorframework.io/docs/faqs/#after-deploying-my-operator-i-see-errors-like-failed-to-watch-external-type.

Do you mind updating the PR to use that option? Otherwise we can add it to our backlog.

Thanks!

CecileRobertMichon commented 5 months ago

Hi @slintes, thanks for the reply and detailed guidance. I've adjusted the PR. Note that the ClientDisableCacheFor no longer exists in controller-runtime, it was deprecated and later removed so https://sdk.operatorframework.io/docs/faqs/#after-deploying-my-operator-i-see-errors-like-failed-to-watch-external-type is outdated.

CecileRobertMichon commented 5 months ago

@slintes while we're here, I'm also seeing:

E0402 04:46:21.482510 1 event.go:346] "Server rejected event (will not retry!)" err="events is forbidden: User \"system:serviceaccount:node-healthcheck:node-healthcheck-controller-manager\" cannot create resource \"events\" in API group \"\" in the namespace \"default\""

Happy to open a separate PR for that one if there is a desirable fix.

slintes commented 5 months ago

Hi @CecileRobertMichon. Thanks for the PR update!

We remembered why you run into the missing namespace permission while we missed it: for some reason, the Operator Lifecycle Manager (OLM), which is the only supported way to install the operator, is adding a cluster role with that permission. So is it safe to assume you don't use OLM?

The missing event permission is related but different: here the permission exists in OLM's deployment configuration (the ClusterServiceVersion manifest, see https://github.com/medik8s/node-healthcheck-operator/blob/6e935de8cc8fd61e56706667499697858002f2d9/bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml#L569. It's coming from here: https://github.com/medik8s/node-healthcheck-operator/blob/main/config/rbac/leader_election_role.yaml#L31. Somehow you miss this in your deployment.

/ok-to-test

openshift-ci[bot] commented 5 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/medik8s/node-healthcheck-operator/blob/main/OWNERS)~~ [slintes] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
CecileRobertMichon commented 5 months ago

@slintes correct, I'm attempting to install the operator without going through OLM to manage its lifecycle. So far it's worked well, aside from these few quirks.