openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
96 stars 130 forks source link

Revert "OCPBUGS-38573: use pooled client for etcd single member health checks" #1322

Closed tjungblu closed 2 months ago

tjungblu commented 2 months ago

Reverts openshift/cluster-etcd-operator#1319

not sure this is working as expected, I see many logs like this now:

[etcd-operator-78ddbd8998-r7f4t] W0821 08:54:24.113978 1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.0.5:2379]] vs. [[https://10.0.0.3:2379 https://10.0.0.4:2379 https://10.0.0.5:2379]]

and then subsequently the controllers picking up the client is not using the right endpoints anymore:

[etcd-operator-78ddbd8998-r7f4t] E0821 09:05:44.532054 1 base_controller.go:268] DefragController reconciliation failed: failed to dial endpoint https://10.0.0.4:2379 with maintenance client: context deadline exceeded

/hold

openshift-ci-robot commented 2 months ago

@tjungblu: This pull request references Jira Issue OCPBUGS-38573, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1322): >Reverts openshift/cluster-etcd-operator#1319 > >not sure this is working as expected, I see many logs like this now: > >> [etcd-operator-78ddbd8998-r7f4t] W0821 08:54:24.113978 1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.0.5:2379]] vs. [[https://10.0.0.3:2379 https://10.0.0.4:2379 https://10.0.0.5:2379]] > >/hold Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
tjungblu commented 2 months ago

/hold cancel

tjungblu commented 2 months ago

/jira refresh

openshift-ci-robot commented 2 months ago

@tjungblu: This pull request references Jira Issue OCPBUGS-38573, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.18.0) matches configured target version for branch (4.18.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @geliu2016

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1322#issuecomment-2301545049): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
tjungblu commented 2 months ago

/retest

dusk125 commented 2 months ago

/lgtm

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-etcd-operator/blob/master/OWNERS)~~ [dusk125,tjungblu] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tjungblu commented 2 months ago

/override ci/prow/e2e-agnostic-ovn /override ci/prow/e2e-aws-ovn-serial

conformance test fails on cpu partitioning, unrelated to this PR

openshift-ci[bot] commented 2 months ago

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-agnostic-ovn, ci/prow/e2e-aws-ovn-serial

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1322#issuecomment-2302017562): >/override ci/prow/e2e-agnostic-ovn >/override ci/prow/e2e-aws-ovn-serial > > >conformance test fails on cpu partitioning, unrelated to this PR Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
tjungblu commented 2 months ago

/override ci/prow/e2e-aws-ovn-single-node

openshift-ci[bot] commented 2 months ago

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-single-node

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1322#issuecomment-2302043350): >/override ci/prow/e2e-aws-ovn-single-node > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 2 months ago

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown f99495253337b76ac063a42092a1eb7f1885e2f9 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-aws-etcd-certrotation f99495253337b76ac063a42092a1eb7f1885e2f9 link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-aws-etcd-recovery f99495253337b76ac063a42092a1eb7f1885e2f9 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown f99495253337b76ac063a42092a1eb7f1885e2f9 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci-robot commented 2 months ago

@tjungblu: Jira Issue OCPBUGS-38573: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-38573 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1322): >Reverts openshift/cluster-etcd-operator#1319 > >not sure this is working as expected, I see many logs like this now: > >> [etcd-operator-78ddbd8998-r7f4t] W0821 08:54:24.113978 1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.0.5:2379]] vs. [[https://10.0.0.3:2379 https://10.0.0.4:2379 https://10.0.0.5:2379]] > >and then subsequently the controllers picking up the client is not using the right endpoints anymore: > >> [etcd-operator-78ddbd8998-r7f4t] E0821 09:05:44.532054 1 base_controller.go:268] DefragController reconciliation failed: failed to dial endpoint https://10.0.0.4:2379 with maintenance client: context deadline exceeded > > >/hold Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-bot commented 2 months ago

[ART PR BUILD NOTIFIER]

Distgit: cluster-etcd-operator This PR has been included in build cluster-etcd-operator-container-v4.18.0-202408211243.p0.g164c37f.assembly.stream.el9. All builds following this will include this PR.