Closed AlexVulaj closed 4 months ago
@AlexVulaj: This pull request references Jira Issue OCPBUGS-36301, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
/jira refresh
@AlexVulaj: This pull request references Jira Issue OCPBUGS-36301, which is valid. The bug has been moved to the POST state.
Requesting review from QA contact: /cc @geliu2016
/lgtm
/cherry-pick release-4.16 release-4.15 release-4.14 release-4.13 release-4.12
@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: AlexVulaj, geliu2016, tjungblu
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/retest-required
/retest-required
Remaining retests: 0 against base HEAD d82a13d2456cb89d6c64b508f80f7f6c36166c98 and 2 for PR HEAD 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b in total
/retest-required
/retest-required
Remaining retests: 0 against base HEAD 9d7b786489bd778c20b3b55f042645efdcf2bf24 and 1 for PR HEAD 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b in total
I think the e2e-aws-ovn-etcd-scaling
failures are unrelated to this change, and are instead a combination of OCPBUGS-36462 (which I've just opened) and a need to refactor the etcd is able to vertically scale up and down with a single node
test case to stop assuming the ControlPlaneMachineSet status.readyReplicas
will hit 4, and instead do something else to check that the roll/recovery completed.
...to stop assuming the ControlPlaneMachineSet
status.readyReplicas
will hit 4...
This is now tracked in ETCD-637. In the meantime, possibly worth an /override ci/prow/e2e-aws-ovn-etcd-scaling
here? Or keep launching retests until we get lucky? Or wait for the CPMS and etcd work to green up the test?
/override ci/prow/e2e-aws-ovn-etcd-scaling
no doubt :)
@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-etcd-scaling
@AlexVulaj: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/e2e-operator-fips | 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b | link | false | /test e2e-operator-fips |
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown | 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b | link | false | /test e2e-metal-ovn-ha-cert-rotation-shutdown |
ci/prow/e2e-aws-etcd-recovery | 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b | link | false | /test e2e-aws-etcd-recovery |
ci/prow/e2e-aws-etcd-certrotation | 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b | link | false | /test e2e-aws-etcd-certrotation |
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown | 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b | link | false | /test e2e-metal-ovn-sno-cert-rotation-shutdown |
ci/prow/e2e-gcp-qe-no-capabilities | 6558a88475c7ef3f49db87b3c3ce7d6c6a4fa42b | link | false | /test e2e-gcp-qe-no-capabilities |
Full PR test history. Your PR dashboard.
unrelated failure
/override ci/prow/e2e-aws-ovn-serial
@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-serial
@AlexVulaj: Jira Issue OCPBUGS-36301: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-36301 has been moved to the MODIFIED state.
@tjungblu: new pull request created: #1290
[ART PR BUILD NOTIFIER]
This PR has been included in build cluster-etcd-operator-container-v4.17.0-202407031527.p0.gaabb6d6.assembly.stream.el9 for distgit cluster-etcd-operator. All builds following this will include this PR.
Currently, member health is checked in serial with a 30s timeout per member. 3 out of 4 GetMemberHealth callers had their own default 30s timeout as well for the entire process. Because of this, a slow check on one member could exhaust the timeout for the entire GetMemberHealth function, and thus cause later-checked members to report as unhealthy even though they were fine.
With this commit, I am dropping the internal 30s timeout from GetMemberHealth, and instead letting the caller set the timeout. Also, the code now checks the health of all members in parallel. This will prevent a single slow member from affecting the health reporting of other members.
I also added a timeout to the context used in IsMemberHealthy which calls GetMemberHealth. Neither Trevor nor I were sure why a default timeout wasn't present there, though one was present in all other call sites.