openshift-kni / lifecycle-agent

Local agent for orchestration of SNO Image Based Upgrade
Apache License 2.0
6 stars 26 forks source link

OCPBUGS-32493: make all client calls retriable with middleware #548

Closed pixelsoccupied closed 4 weeks ago

pixelsoccupied commented 1 month ago

Background / Context

This builds on https://github.com/openshift-kni/lifecycle-agent/pull/541. We now have middleware that will do the retries, making all the client call implicitly retriable.

As long as clients are configured to include it during initialization...the middleware will intercept all calls and retry exponentially for max 3 mins if any of the error conditions are met.

/cc @donpenney

openshift-ci-robot commented 1 month ago

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @yliu127

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift-kni/lifecycle-agent/pull/548): ># Background / Context > >This builds on https://github.com/openshift-kni/lifecycle-agent/pull/541. All the remaining clients calls during Prep (stateroot setup job) and Idle (all the cleanups that need API server) are wrapped to retry calls before finally failing. > >Backoff is configured and tested to withstand 5 mins outage + any the time needed for the API server to recover...but should be able to retry much longer. A logger `RetryClient` will print the error as an info and retry if the conditions are met such as `connection refused` > >/cc @donpenney Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift-kni%2Flifecycle-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
pixelsoccupied commented 1 month ago

/test ibu-e2e-flow

openshift-ci-robot commented 1 month ago

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @yliu127

In response to [this](https://github.com/openshift-kni/lifecycle-agent/pull/548): ># Background / Context > >This builds on https://github.com/openshift-kni/lifecycle-agent/pull/541. All the remaining clients calls during Prep (stateroot setup job) and Idle (all the cleanups that need API server) are wrapped to retry calls before finally failing. > >Backoff is configured and tested to withstand 5 mins outage + any the time needed for the API server to recover...but should be able to retry much longer. A logger `CallWithRetry` will print the error as an info and retry if the conditions are met such as `connection refused` > >/cc @donpenney Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift-kni%2Flifecycle-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
pixelsoccupied commented 1 month ago

/hold will update the existing retry func instead

openshift-ci-robot commented 4 weeks ago

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @yliu127

In response to [this](https://github.com/openshift-kni/lifecycle-agent/pull/548): ># Background / Context > >This builds on https://github.com/openshift-kni/lifecycle-agent/pull/541. We now have middleware that will do the retries, making all the client call implicitly retriable. > >As long as clients are configured to include it during initialization...the middleware will intercept all calls and retry exponentially for max 3 mins if any of the error conditions are met. > >/cc @donpenney Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift-kni%2Flifecycle-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
pixelsoccupied commented 4 weeks ago

/hold testing the changes e2e

pixelsoccupied commented 4 weeks ago

/unhold

Will remove all the existing wrapped calls in a separate PR (essentially those are now redundant) but everything is working well! Just a FYI the post-pivot dynamicclient middleware update will be picked up once a new seed is generated (but of course upgrade will go through regardless)

Missxiaoguo commented 4 weeks ago

I also tried this on my env. It worked well! Thanks for your great work! @pixelsoccupied /lgtm

pixelsoccupied commented 4 weeks ago

/test ibi-e2e-flow

openshift-ci[bot] commented 4 weeks ago

@pixelsoccupied: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ibi-e2e-flow 2c1e6ca114081eb8b2b187e713253b129825d610 link false /test ibi-e2e-flow

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
browsell commented 4 weeks ago

/approve

openshift-ci[bot] commented 4 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: browsell

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift-kni/lifecycle-agent/blob/main/OWNERS)~~ [browsell] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 4 weeks ago

@pixelsoccupied: Jira Issue OCPBUGS-32493: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-32493 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift-kni/lifecycle-agent/pull/548): ># Background / Context > >This builds on https://github.com/openshift-kni/lifecycle-agent/pull/541. We now have middleware that will do the retries, making all the client call implicitly retriable. > >As long as clients are configured to include it during initialization...the middleware will intercept all calls and retry exponentially for max 3 mins if any of the error conditions are met. > >/cc @donpenney Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift-kni%2Flifecycle-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.