openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.44k stars 4.69k forks source link

OCPBUGS-31492: Add a test that will fail on over 10k etcd took too long messages #28674

Closed dgoodwin closed 1 month ago

dgoodwin commented 1 month ago

We've found a subset of jobs in a specific environment showing extremely unhealthy etcd pod logs throughout the run, indicating disk IO issues for etcd.

Add a test to help identify these runs, and clearly communicate with engineers looking at the failures to help them notice that etcd is unhealthy and this can cause a multitude of other failures.

neisw commented 1 month ago

/lgtm

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, neisw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/origin/blob/master/OWNERS)~~ [dgoodwin,neisw] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 1 month ago

@dgoodwin: This pull request references Jira Issue OCPBUGS-31492, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/origin/pull/28674): >We've found a subset of jobs in a specific environment showing extremely >unhealthy etcd pod logs throughout the run, indicating disk IO issues >for etcd. > >Add a test to help identify these runs, and clearly communicate with >engineers looking at the failures to help them notice that etcd is >unhealthy and this can cause a multitude of other failures. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Forigin). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
dgoodwin commented 1 month ago

/jira refresh

openshift-ci-robot commented 1 month ago

@dgoodwin: This pull request references Jira Issue OCPBUGS-31492, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to [this](https://github.com/openshift/origin/pull/28674#issuecomment-2025123824): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Forigin). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
dgoodwin commented 1 month ago

/jira refresh

openshift-ci-robot commented 1 month ago

@dgoodwin: This pull request references Jira Issue OCPBUGS-31492, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.16.0) matches configured target version for branch (4.16.0) * bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
In response to [this](https://github.com/openshift/origin/pull/28674#issuecomment-2025670635): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Forigin). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 1 month ago

/retest-required

Remaining retests: 0 against base HEAD 8b3dee6ac31a5ba48f20bcefe96b10c8a2102b54 and 2 for PR HEAD 44eeffd6a4603cc2beb9467f76ce03eb4ffeb440 in total

neisw commented 1 month ago

/retest-required

openshift-ci[bot] commented 1 month ago

@dgoodwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-ovn-cmd 44eeffd6a4603cc2beb9467f76ce03eb4ffeb440 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-aws-ovn-single-node 44eeffd6a4603cc2beb9467f76ce03eb4ffeb440 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-upgrade 44eeffd6a4603cc2beb9467f76ce03eb4ffeb440 link false /test e2e-aws-ovn-upgrade
ci/prow/e2e-metal-ipi-sdn 44eeffd6a4603cc2beb9467f76ce03eb4ffeb440 link false /test e2e-metal-ipi-sdn
ci/prow/e2e-aws-ovn-single-node-upgrade 44eeffd6a4603cc2beb9467f76ce03eb4ffeb440 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-single-node-serial 44eeffd6a4603cc2beb9467f76ce03eb4ffeb440 link false /test e2e-aws-ovn-single-node-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-trt-bot commented 1 month ago

Job Failure Risk Analysis for sha: 44eeffd6a4603cc2beb9467f76ce03eb4ffeb440

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]
This test has passed 93.10% of 29 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-serial'] in the last 14 days.

Open Bugs
Jobs are failing on watch request limits for cluster-node-tuning-operator
neisw commented 1 month ago

/retest-required

openshift-ci-robot commented 1 month ago

@dgoodwin: Jira Issue OCPBUGS-31492: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-31492 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/origin/pull/28674): >We've found a subset of jobs in a specific environment showing extremely >unhealthy etcd pod logs throughout the run, indicating disk IO issues >for etcd. > >Add a test to help identify these runs, and clearly communicate with >engineers looking at the failures to help them notice that etcd is >unhealthy and this can cause a multitude of other failures. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Forigin). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-bot commented 1 month ago

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-enterprise-tests-container-v4.16.0-202404011414.p0.g56867df.assembly.stream.el8 for distgit openshift-enterprise-tests. All builds following this will include this PR.