openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
96 stars 130 forks source link

OCPBUGS-30873: CEO aliveness check should only detect deadlocks #1223

Closed tjungblu closed 8 months ago

tjungblu commented 8 months ago

Currently we only detect whether a controller has been running continuously into errors. Whereas we wanted to detect real deadlock situations. This change defuses the aliveness check to only declare real locking situations as problematic.

Additionally, to not create insane amounts of log traffic, this change will throttle the stack dumping to once every 15 minutes. Previously it would trigger almost immediately every health probe invocation and create multi-megabyte log spam.

openshift-ci-robot commented 8 months ago

@tjungblu: This pull request references Jira Issue OCPBUGS-30873, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.16.0) matches configured target version for branch (4.16.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @geliu2016

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1223): >Currently we only detect whether a controller has been running continuously into errors. Whereas we wanted to detect real deadlock situations. This change defuses the aliveness check to only declare real locking situations as problematic. > >Additionally, to not create insane amounts of log traffic, this change will throttle the stack dumping to once every 15 minutes. Previously it would trigger almost immediately every health probe invocation and create multi-megabyte log spam. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
dusk125 commented 8 months ago

/lgtm

openshift-ci[bot] commented 8 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-etcd-operator/blob/master/OWNERS)~~ [dusk125,tjungblu] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tjungblu commented 8 months ago

/cherry-pick release-4.15

openshift-cherrypick-robot commented 8 months ago

@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1223#issuecomment-1994476078): >/cherry-pick release-4.15 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-ci-robot commented 8 months ago

/retest-required

Remaining retests: 0 against base HEAD 463979a2bdc3e2d31ed4d94f2624ea1a2c39fb44 and 2 for PR HEAD 5e0db1bd00202e900b17a9e0f0859ad815e20f17 in total

tjungblu commented 8 months ago

unrelated failures

/override ci/prow/e2e-aws-ovn-serial /override ci/prow/e2e-operator-fips

openshift-ci[bot] commented 8 months ago

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-serial, ci/prow/e2e-operator-fips

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1223#issuecomment-1996788666): >unrelated failures > >/override ci/prow/e2e-aws-ovn-serial >/override ci/prow/e2e-operator-fips > > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tjungblu commented 8 months ago

/override ci/prow/e2e-aws-ovn-single-node

openshift-ci[bot] commented 8 months ago

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-ovn-single-node

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1223#issuecomment-1996820670): >/override ci/prow/e2e-aws-ovn-single-node Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 8 months ago

@tjungblu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-qe-no-capabilities 5e0db1bd00202e900b17a9e0f0859ad815e20f17 link false /test e2e-gcp-qe-no-capabilities

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci-robot commented 8 months ago

@tjungblu: Jira Issue OCPBUGS-30873: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-30873 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1223): >Currently we only detect whether a controller has been running continuously into errors. Whereas we wanted to detect real deadlock situations. This change defuses the aliveness check to only declare real locking situations as problematic. > >Additionally, to not create insane amounts of log traffic, this change will throttle the stack dumping to once every 15 minutes. Previously it would trigger almost immediately every health probe invocation and create multi-megabyte log spam. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-cherrypick-robot commented 8 months ago

@tjungblu: new pull request created: #1225

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1223#issuecomment-1994476078): >/cherry-pick release-4.15 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.