openshift / cluster-nfd-operator

The Cluster Node Feature Discovery operator manages detection of hardware features and configuration in a Openshift cluster.
Apache License 2.0
34 stars 42 forks source link

Remove generation check in NFD update predicate #265

Closed mresvanis closed 2 years ago

mresvanis commented 2 years ago

This PR removes the update predicate that checks for the change of the .metadata.generation field of the NFD CR, in order to be able to fully reconcile a created NFD CR after its finalizer and status is updated.

Context

According to the Kubernetes documentation, .metadata or .status updates do not bump the .metadata.generation of an object. This leads to the NFD CR not being fully reconciled after its creation, i.e. after the addition of its finalizer and the update of its .status, there is no additional update reconciliation event queued. The cause is the update predicate used to check for the difference between the .metadata.generation field of the old and new NFD objects.

kubectl vs Kubernetes Go client

This issue cannot be reproduced using kubectl (not sure yet why that is), but it is reproducible by using a Kubernetes Go client (e.g. by creating a new NFD CR via another controller-runtime-based operator).

Resources

The .metadata.generation value is incremented for all changes, except for changes to .metadata or .status.

Signed-off-by: Michail Resvanis mresvani@redhat.com

ArangoGutierrez commented 2 years ago

Thanks for the PR @mresvanis , could you do so against https://github.com/kubernetes-sigs/node-feature-discovery-operator ?

ArangoGutierrez commented 2 years ago

/uncc dagrayvid

ArangoGutierrez commented 2 years ago

/assign

mresvanis commented 2 years ago

Thanks for the PR @mresvanis , could you do so against https://github.com/kubernetes-sigs/node-feature-discovery-operator ?

@ArangoGutierrez thank you for your comment, but if I'm not mistaken the upstream project currently doesn't have this issue. Specifically, it doesn't set up any predicates for the NFD CR events.

ArangoGutierrez commented 2 years ago

/test all

ArangoGutierrez commented 2 years ago

Thanks for this PR @mresvanis , can you test that this change doesn't add unwanted logs to the controller? @lenahorsley

yevgeny-shnaidman commented 2 years ago

@mresvanis, the reconciliation loop always returns Requeue=True upon status update when not all the components of the CR has been fully reconciled. Can you point what is the exact scenario that causes the reconciliation loop to stop? Maybe we can fix those specific cases in the reconciliation loop instead of in the predicate?

mresvanis commented 2 years ago

@yevgeny-shnaidman the case I bumped into, when Requeue=false, is when adding the finalizer (so it's a .metadata update that leads to the update event not being requeued).

As you suggested, I can change the line above to Requeue=true, instead of removing the predicate. Whatever makes more sense in your view.

IMHO having this predicate is indeed an optimization, which could lead to further issues down the road and I'm not so sure that it's worth it. But I don't have enough experience with the NFD operator to be able to support this opinion deeply.

@ArangoGutierrez

Thanks for this PR @mresvanis , can you test that this change doesn't add unwanted logs to the controller?

We are currently running a fork of the Cluster NFD operator with this PR included in production and as far as I can tell, we haven't seen more logs than the usual ones.

yevgeny-shnaidman commented 2 years ago

@yevgeny-shnaidman the case I bumped into, when Requeue=false, is when adding the finalizer (so it's a .metadata update that leads to the update event not being requeued).

As you suggested, I can change the line above to Requeue=true, instead of removing the predicate. Whatever makes more sense in your view. Please do, I think it is better to change the reconcile loop instead of the predicate. Just wondering why we have not seen such cases before...

IMHO having this predicate is indeed an optimization, which could lead to further issues down the road and I'm not so sure that it's worth it. But I don't have enough experience with the NFD operator to be able to support this opinion deeply.

@ArangoGutierrez

Thanks for this PR @mresvanis , can you test that this change doesn't add unwanted logs to the controller?

We are currently running a fork of the Cluster NFD operator with this PR included in production and as far as I can tell, we haven't seen more logs than the usual ones.

yevgeny-shnaidman commented 2 years ago

/lgtm

openshift-ci[bot] commented 2 years ago

@mresvanis: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
mresvanis commented 2 years ago

/hold

mresvanis commented 2 years ago

/unhold

lenahorsley commented 2 years ago

4.11 Build version: 4.11.0-0.nightly-2022-07-26-041421 NFD version: quay.io/lenahorsley/nfd-operator-bundle:4.11.20220726-pr265 Kube version: v1.24.0+9546431

4.12 Build version: 4.12.0-0.nightly-2022-07-26-082655 NFD version: quay.io/lenahorsley/nfd-operator-bundle:4.12.20220726-pr265 Kube version: v1.24.0+1ff0731

Tests performed (AWS 3 master/3 worker nodes):

  1. Build and deploy NFD bundle. Check the worker node labels.
  2. Add a worker node and check new node labels.
  3. Remove the newly added node and check the labels on the remaining nodes.
  4. Log creation and stability.
  5. Check for pods restarts in the openshift-etcd namespace (there were no restarts).

Note: for both bundle versions, NFD had ONE master node and THREE worker nodes

All tests were executed successfully.

openshift-ci[bot] commented 2 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ArangoGutierrez, mresvanis, yevgeny-shnaidman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-nfd-operator/blob/master/OWNERS)~~ [ArangoGutierrez] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment