openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
96 stars 130 forks source link

OCPBUGS-36301: parallelize member health checks #1286

Closed AlexVulaj closed 4 months ago

AlexVulaj commented 4 months ago

Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine.

With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members.

I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites.

openshift-ci-robot commented 4 months ago

@AlexVulaj: This pull request references Jira Issue OCPBUGS-36301, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1286): >Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine. > >With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members. > >I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
AlexVulaj commented 4 months ago

/jira refresh

openshift-ci-robot commented 4 months ago

@AlexVulaj: This pull request references Jira Issue OCPBUGS-36301, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @geliu2016

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1286#issuecomment-2197419290): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
tjungblu commented 4 months ago

/lgtm

tjungblu commented 4 months ago

/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12

openshift-cherrypick-robot commented 4 months ago

@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1286#issuecomment-2200421634): >/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AlexVulaj, geliu2016, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-etcd-operator/blob/master/OWNERS)~~ [tjungblu] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tjungblu commented 4 months ago

/retest-required

openshift-ci-robot commented 4 months ago

/retest-required

Remaining retests: 0 against base HEAD d82a13d2456cb89d6c64b508f80f7f6c36166c98 and 2 for PR HEAD 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b in total

tjungblu commented 4 months ago

/retest-required

openshift-ci-robot commented 4 months ago

/retest-required

Remaining retests: 0 against base HEAD 9d7b786489bd778c20b3b55f042645efdcf2bf24 and 1 for PR HEAD 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b in total

wking commented 4 months ago

I think the e2e-aws-ovn-etcd-scaling failures are unrelated to this change, and are instead a combination of OCPBUGS-36462 (which I've just opened) and a need to refactor the etcd is able to vertically scale up and down with a single node test case to stop assuming the ControlPlaneMachineSet status.readyReplicas will hit 4, and instead do something else to check that the roll/recovery completed.

wking commented 4 months ago

...to stop assuming the ControlPlaneMachineSet status.readyReplicas will hit 4...

This is now tracked in ETCD-637. In the meantime, possibly worth an /override ci/prow/e2e-aws-ovn-etcd-scaling here? Or keep launching retests until we get lucky? Or wait for the CPMS and etcd work to green up the test?

tjungblu commented 4 months ago

/override ci/prow/e2e-aws-ovn-etcd-scaling

no doubt :)

openshift-ci[bot] commented 4 months ago

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-etcd-scaling

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1286#issuecomment-2205251659): >/override ci/prow/e2e-aws-ovn-etcd-scaling > >no doubt :) Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 4 months ago

@AlexVulaj: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-operator-fips 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b link false /test e2e-operator-fips
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-aws-etcd-recovery 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b link false /test e2e-aws-etcd-recovery
ci/prow/e2e-aws-etcd-certrotation 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-gcp-qe-no-capabilities 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b link false /test e2e-gcp-qe-no-capabilities

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
tjungblu commented 4 months ago

unrelated failure

/override ci/prow/e2e-aws-ovn-serial

openshift-ci[bot] commented 4 months ago

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-serial

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1286#issuecomment-2205765739): >unrelated failure > >/override ci/prow/e2e-aws-ovn-serial Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-ci-robot commented 4 months ago

@AlexVulaj: Jira Issue OCPBUGS-36301: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-36301 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1286): >Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine. > >With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members. > >I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-cherrypick-robot commented 4 months ago

@tjungblu: new pull request created: #1290

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1286#issuecomment-2200421634): >/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-bot commented 4 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-etcd-operator-container-v4.17.0-202407031527.p0.gaabb6d6.assembly.stream.el9 for distgit cluster-etcd-operator. All builds following this will include this PR.