openshift / router

Ingress controller for OpenShift
Apache License 2.0
68 stars 114 forks source link

OCPBUGS-26498: Optimize Upgrade Validation plugin by skipping unnecessary changes #587

Closed gcs278 closed 4 months ago

gcs278 commented 4 months ago

Both performIngressConditionUpdate and performIngressConditionRemoval functions add tasks to the writerlease queue even if no work needed to be done. This commit optimizes the Upgrade Validation plugin by ensuring that tasks for updating UnservableInFutureVersions are only added to the queue when changes are required.

This reduction in unnecessary work significantly lowers the frequency of lease extensions. Previously, excessive lease extensions could delay route status updates under certain conditions, such as during temporary contention periods where a router pod gets demoted to a follower. After the contention is resolved, the pod’s subsequent retry will be delayed more than necessary.

Additionally, this update provides a clearer indication of when updates are actually required, but have been completed by another router pod. The prior logic did not clearly distinguish between unnecessary updates and updates that were completed by another pod.

The selective updates are only added to the interface used by Upgrade Validation plugin to avoid perturbing existing Admitted condition logic. This means nominal clusters without SHA1 routes should not be impacted by the Upgrade Validation plugin, minimizing the risk associated with its introduction.

Fixes TestRouteAdmissionPolicy Flake by reducing the number of unnecessary lease extensions.

openshift-ci-robot commented 4 months ago

@gcs278: This pull request references Jira Issue OCPBUGS-26498, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.16.0) matches configured target version for branch (4.16.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @ShudiLi

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/router/pull/587): >Previously, both performIngressConditionUpdate and performIngressConditionRemoval functions were adding tasks to the writerlease queue even if no work needed to be done. This commit optimizes the logic by ensuring that tasks are only added to the queue when changes are required. > >This reduction in unnecessary work significantly lowers the frequency of lease extensions. Previously, excessive lease extensions could delay route status updates under certain conditions, such as during temporary contention periods >where a router pod gets demoted to a follower. After the contention is resolved, the pod’s subsequent retry will be delayed more than necessary. > >Additionally, this update provides a clearer indication of when updates are actually required, but have been completed by another router pod. The prior logic did not clearly distinguish between unnecessary updates and updates that were completed by another pod. > >This PR also add some logging in leasewriter and the plugin chain for assisting future debugging efforts. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Frouter). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
gcs278 commented 4 months ago

Though this is a valid improvement, I'm closing in favor of https://github.com/openshift/router/pull/588 as I now don't think this PR fixes the TestRouteAdmissionPolicy flake.

/close

openshift-ci[bot] commented 4 months ago

@gcs278: Closed this PR.

In response to [this](https://github.com/openshift/router/pull/587#issuecomment-2083816265): >Though this is a valid improvement, I'm closing in favor of https://github.com/openshift/router/pull/588 as I now don't think this PR fixes the `TestRouteAdmissionPolicy` flake. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-ci-robot commented 4 months ago

@gcs278: This pull request references Jira Issue OCPBUGS-26498. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/router/pull/587): >Previously, both performIngressConditionUpdate and performIngressConditionRemoval functions were adding tasks to the writerlease queue even if no work needed to be done. This commit optimizes the logic by ensuring that tasks are only added to the queue when changes are required. > >This reduction in unnecessary work significantly lowers the frequency of lease extensions. Previously, excessive lease extensions could delay route status updates under certain conditions, such as during temporary contention periods >where a router pod gets demoted to a follower. After the contention is resolved, the pod’s subsequent retry will be delayed more than necessary. > >Additionally, this update provides a clearer indication of when updates are actually required, but have been completed by another router pod. The prior logic did not clearly distinguish between unnecessary updates and updates that were completed by another pod. > >Fixes [TestRouteAdmissionPolicy Flake](https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1049/pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator/1783251298133479424) by reducing the number of unnecessary lease extensions. > >This PR also add some logging in writerlease and the plugin chain for assisting future debugging efforts. > >WIP because it needs testing. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Frouter). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
gcs278 commented 4 months ago

TestRouteAdmissionPolicy is still flaking, and this may be a reasonable fix. /reopen

openshift-ci[bot] commented 4 months ago

@gcs278: Reopened this PR.

In response to [this](https://github.com/openshift/router/pull/587#issuecomment-2087964087): >`TestRouteAdmissionPolicy` is still flaking, and this may be a reasonable fix. >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-ci-robot commented 4 months ago

@gcs278: This pull request references Jira Issue OCPBUGS-26498, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.16.0) matches configured target version for branch (4.16.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @ShudiLi

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/router/pull/587): >Previously, both performIngressConditionUpdate and performIngressConditionRemoval functions were adding tasks to the writerlease queue even if no work needed to be done. This commit optimizes the logic by ensuring that tasks are only added to the queue when changes are required. > >This reduction in unnecessary work significantly lowers the frequency of lease extensions. Previously, excessive lease extensions could delay route status updates under certain conditions, such as during temporary contention periods >where a router pod gets demoted to a follower. After the contention is resolved, the pod’s subsequent retry will be delayed more than necessary. > >Additionally, this update provides a clearer indication of when updates are actually required, but have been completed by another router pod. The prior logic did not clearly distinguish between unnecessary updates and updates that were completed by another pod. > >Fixes [TestRouteAdmissionPolicy Flake](https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1049/pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator/1783251298133479424) by reducing the number of unnecessary lease extensions. > >This PR also add some logging in writerlease and the plugin chain for assisting future debugging efforts. > >WIP because it needs testing. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Frouter). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
gcs278 commented 4 months ago

Putting a hold to wait for testing confirmation. /hold

Miciah commented 4 months ago

Thanks! /approve /lgtm

openshift-ci[bot] commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/router/blob/master/OWNERS)~~ [Miciah] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci[bot] commented 4 months ago

@gcs278: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
gcs278 commented 4 months ago

I've ran nearly a hundred times with my e2e test in https://github.com/openshift/origin/pull/28710 and no issues. Reviewed the router logs and lease extends are drastically reduced. /hold cancel

openshift-ci-robot commented 4 months ago

@gcs278: Jira Issue OCPBUGS-26498: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-26498 has not been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/router/pull/587): >Both performIngressConditionUpdate and performIngressConditionRemoval functions add tasks to the writerlease queue even if no work needed to be done. This commit optimizes the Upgrade Validation plugin by ensuring that tasks for updating UnservableInFutureVersions are only added to the queue when changes are required. > >This reduction in unnecessary work significantly lowers the frequency of lease extensions. Previously, excessive lease extensions could delay route status updates under certain conditions, such as during temporary contention periods >where a router pod gets demoted to a follower. After the contention is resolved, the pod’s subsequent retry will be delayed more than necessary. > >Additionally, this update provides a clearer indication of when updates are actually required, but have been completed by another router pod. The prior logic did not clearly distinguish between unnecessary updates and updates that were completed by another pod. > >The selective updates are only added to the interface used by Upgrade Validation plugin to avoid perturbing existing Admitted condition logic. This means nominal clusters without SHA1 routes should not be impacted by the Upgrade >Validation plugin, minimizing the risk associated with its introduction. > >Fixes [TestRouteAdmissionPolicy Flake](https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1049/pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator/1783251298133479424) by reducing the number of unnecessary lease extensions. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Frouter). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-bot commented 4 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-haproxy-router-base-container-v4.16.0-202405020546.p0.g5610ac8.assembly.stream.el9 for distgit ose-haproxy-router-base. All builds following this will include this PR.