openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.49k stars 4.7k forks source link

ETCD-674: Add E2E test for scaling when an unhealthy member is present #29203

Closed jubittajohn closed 1 week ago

jubittajohn commented 1 month ago

The following test covers a vertical scaling scenario when a member is unhealthy.This test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.

  1. If the CPMS is active, first disable it by deleting the CPMS custom resource.
  2. Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.
  3. Delete the machine hosting the node in step 2.
  4. Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.
  5. Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS
openshift-ci-robot commented 1 month ago

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/origin/pull/29203): > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Forigin). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jubittajohn Once this PR has been reviewed and has the lgtm label, please assign hasbro17 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[test/extended/etcd/OWNERS](https://github.com/openshift/origin/blob/master/test/extended/etcd/OWNERS)** - **[test/extended/util/annotate/generated/OWNERS](https://github.com/openshift/origin/blob/master/test/extended/util/annotate/generated/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
jubittajohn commented 1 month ago

/test e2e-aws-ovn-etcd-scaling

jubittajohn commented 1 month ago

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

openshift-ci-robot commented 1 month ago

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/origin/pull/29203): > The following test covers a vertical scaling scenario when a member is unhealthy.This test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case. > >1. If the CPMS is active, first disable it by deleting the CPMS custom resource. >2. Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy. >3. Delete the machine hosting the node in step 2. >4. Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy. >5. Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Forigin). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
jubittajohn commented 1 month ago

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

openshift-trt-bot commented 1 month ago

Job Failure Risk Analysis for sha: 34733b45a78f25999f732b64199b6aa57b4a58c0

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout IncompleteTests
Tests for this run (101) are below the historical average (1155): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (101) are below the historical average (2074): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn IncompleteTests
Tests for this run (101) are below the historical average (2242): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
jubittajohn commented 1 month ago

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

openshift-trt-bot commented 1 month ago

Job Failure Risk Analysis for sha: 5c2b0d31440f177b19c309c71e83065e45ec921b

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout IncompleteTests
Tests for this run (101) are below the historical average (1064): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (101) are below the historical average (1813): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn IncompleteTests
Tests for this run (101) are below the historical average (2078): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 38.46% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.
jubittajohn commented 1 month ago

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

openshift-trt-bot commented 1 month ago

Job Failure Risk Analysis for sha: b5520eeed3d3f83ad07709d8d4b1277cf0871fa1

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling High
[sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 7 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High
[sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel]
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
jubittajohn commented 1 month ago

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

openshift-trt-bot commented 1 month ago

Job Failure Risk Analysis for sha: 5d2c0c7ab5eec2b9fbc4b10263ebffe9fe6a983b

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling High
[sig-api-machinery] disruption/cache-openshift-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/cache-oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/kube-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High
[sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

Open Bugs
etcd platform pod exist test failing on etcd-scaling jobs
---
[bz-etcd][invariant] alert/etcdMembersDown should not be at or above info
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

Open Bugs
etcd-scaling jobs failing ~60% of the time
jubittajohn commented 1 month ago

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

openshift-trt-bot commented 1 month ago

Job Failure Risk Analysis for sha: da504c15843ba72b884d6f8fcbbc39370b57bf0c

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling High
[sig-api-machinery] disruption/cache-kube-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/cache-oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/kube-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
[sig-api-machinery] disruption/oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
---
Showing 4 of 7 test results
jubittajohn commented 3 weeks ago

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

jubittajohn commented 3 weeks ago

/test e2e-gcp-ovn-etcd-scaling

jubittajohn commented 3 weeks ago

/test e2e-aws-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

openshift-trt-bot commented 3 weeks ago

Job Failure Risk Analysis for sha: e2407560f825fd1db05c38fbda5b86dca4056f5e

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High
[bz-etcd][invariant] alert/etcdMembersDown should not be at or above info
This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
jubittajohn commented 1 week ago

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

openshift-trt-bot commented 1 week ago

Job Failure Risk Analysis for sha: c85351e18e0473868dca3489296ad9367b01b65a

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 69.23% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.
jubittajohn commented 1 week ago

/test e2e-aws-ovn-etcd-scaling /test e2e-gcp-ovn-etcd-scaling /test e2e-azure-ovn-etcd-scaling /test e2e-vsphere-ovn-etcd-scaling

openshift-ci[bot] commented 1 week ago

@jubittajohn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn 723212be94976f6923ffb2fed157689ef9b876c6 link true /test e2e-gcp-ovn
ci/prow/e2e-openstack-ovn 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-openstack-ovn
ci/prow/e2e-gcp-ovn-etcd-scaling 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-etcd-scaling 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-single-node-serial 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-azure-ovn-etcd-scaling 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-vsphere-ovn-etcd-scaling 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-agnostic-ovn-cmd 723212be94976f6923ffb2fed157689ef9b876c6 link false /test e2e-agnostic-ovn-cmd

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-trt-bot commented 1 week ago

Job Failure Risk Analysis for sha: 723212be94976f6923ffb2fed157689ef9b876c6

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling High
jubittajohn commented 1 week ago

Added this test to the PR: https://github.com/openshift/origin/pull/29236