openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
96 stars 130 forks source link

WIP: ETCD-612: Introduced a dedicated quorumz handler to ensure PDB is violated if quorum is not safe #1278

Closed jubittajohn closed 2 months ago

jubittajohn commented 5 months ago

Instead of checking for quorum in all controllers that could initiate a revision rollout, the functionality of PDB is leveraged to block the static pod rollout.

A dedicated quorumz handler is introduced in the existing readyz container which checks for quorum similar to the existing CheckSafeToScaleCluster functionality. This marks all etcd guard pods as NOT_READY when quorum is not safe, ensuring PDB is violated and blocking the additional pod scheduling.

Removed the existing quorum checks in the different controllers.

openshift-ci-robot commented 5 months ago

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1278): > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 5 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jubittajohn Once this PR has been reviewed and has the lgtm label, please assign tjungblu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/openshift/cluster-etcd-operator/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 5 months ago

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1278): >This PR uses openshift/library-go#1749 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 5 months ago

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1278): >This PR is blocked by openshift/library-go#1749 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
lance5890 commented 5 months ago

+1

tjungblu commented 4 months ago

@jubittajohn once the CI here is mostly green, you can run the payload jobs using /payload.

For example /payload 4.17 nightly blocking will run all 4.17 nightly jobs that are a must-have to generate a release ("blocking") with this PR. This tests several clouds and form factors of OpenShift.

You can check those out here as an example for the usual nightly runs: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.17.0-0.nightly/release/4.17.0-0.nightly-2024-06-20-005211

be aware, those tests are somewhat expensive to run, so use them sparingly when you feel like you're close to be done and you just want to have additional assurance that you don't break openshift as a whole somehow.

dusk125 commented 4 months ago

/retest-required

dusk125 commented 4 months ago

/retest-required

dusk125 commented 4 months ago

/retest-required

openshift-ci-robot commented 3 months ago

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1278): >Instead of checking for quorum in all controllers that could initiate a revision rollout, the functionality of PDB is leveraged to block the static pod rollout. > >A dedicated `quorumz` handler is introduced in the existing `readyz` container which checks for quorum similar to the existing `CheckSafeToScaleCluster` functionality. This marks all etcd guard pods as NOT_READY when quorum is not safe, ensuring PDB is violated and blocking the additional pod scheduling. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 3 months ago

@jubittajohn: This pull request references ETCD-612 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1278): >Instead of checking for quorum in all controllers that could initiate a revision rollout, the functionality of PDB is leveraged to block the static pod rollout. > >A dedicated `quorumz` handler is introduced in the existing `readyz` container which checks for quorum similar to the existing `CheckSafeToScaleCluster` functionality. This marks all etcd guard pods as NOT_READY when quorum is not safe, ensuring PDB is violated and blocking the additional pod scheduling. > >Removed the existing quorum checks in the different controllers. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
tjungblu commented 3 months ago

/retest

openshift-ci[bot] commented 2 months ago

@jubittajohn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-qe-no-capabilities 2be2710a623444ee2a412544b9cac27417ce5815 link false /test e2e-gcp-qe-no-capabilities
ci/prow/e2e-aws-etcd-certrotation 0ec8504fe821e477e5808b67e6ae7c76a1ef5764 link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-aws-etcd-recovery 0ec8504fe821e477e5808b67e6ae7c76a1ef5764 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-aws-ovn-serial 0ec8504fe821e477e5808b67e6ae7c76a1ef5764 link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-ovn-etcd-scaling 0ec8504fe821e477e5808b67e6ae7c76a1ef5764 link true /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 0ec8504fe821e477e5808b67e6ae7c76a1ef5764 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 0ec8504fe821e477e5808b67e6ae7c76a1ef5764 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
jubittajohn commented 2 months ago

While performing rollouts and applying new manifests, Kubelet doesn't go through the eviction API and hence the PDB doesn't matter in this case. Because of this limitation, we can't fully rely on the functionality of guard pods to block the static pod rollout.