openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
95 stars 127 forks source link

OCPBUGS-36621: Force sync on missing etcd-all-bundles configmap #1296

Closed hasbro17 closed 2 months ago

hasbro17 commented 2 months ago

The etcdcertsigner controller would previously not run if a revision rollout was already in progress which is necessary for the distribution of CA before leaf cert generation during a CA rotation. However in the event where there is no cert rotation ongoing and the etcd-all-bundles configmap is missing the revision rollout can get stuck as the installer pod won't find the etcd-all-bundles configmap to install on disk.

In that case the etcdcertsigner controller would never generate the etcd-all-bundles configmap as it waits for the revision rollout that's in turn waiting on the configmap to be present.

This commit adds an override to let the controller sync and regenerate the configmap if it's missing similar to how the controller runs during the bootstrap phase.

openshift-ci-robot commented 2 months ago

@hasbro17: This pull request references Jira Issue OCPBUGS-36621, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @geliu2016

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1296): >The etcdcertsigner controller would previously not run if a revision rollout was already in progress which is necessary for the distribution of CA before leaf cert generation during a CA rotation. However in the event where there is no cert rotation ongoing and the etcd-all-bundles configmap is missing the revision rollout can get stuck as the installer pod won't find the etcd-all-bundles configmap to install on disk. > >In that case the etcdcertsigner controller would never generate the etcd-all-bundles configmap as it waits for the revision rollout that's in turn waiting on the configmap to be present. > >This commit adds an override to let the controller sync and regenerate the configmap if it's missing similar to how the controller runs during the bootstrap phase. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
hasbro17 commented 2 months ago

This bug has shown up rather elusively as 1 in the last 20 consecutive runs on periodic-ci-openshift-release-master-ci-X.X-upgrade-from-stable-X.X-e2e-aws-ovn-upgrade So let's do 20 aggregate runs to begin with.

/payload-aggregate periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade 20

openshift-ci[bot] commented 2 months ago

@hasbro17: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/10d2cee0-453c-11ef-945b-bf6936f83d69-0

hasbro17 commented 2 months ago

/hold

dusk125 commented 2 months ago

/lgtm

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-etcd-operator/blob/master/OWNERS)~~ [dusk125,hasbro17] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
hasbro17 commented 2 months ago

/override ci/prow/e2e-aws-ovn-etcd-scaling

openshift-ci[bot] commented 2 months ago

@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/e2e-aws-ovn-etcd-scaling

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1296#issuecomment-2237718260): >/override ci/prow/e2e-aws-ovn-etcd-scaling Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
hasbro17 commented 2 months ago

/label acknowledge-critical-fixes-only

Since this flake is is causing some amount of CI noise, it would be good to get some runs over the soak time over the weekend to see if this makes a difference.

/test e2e-aws-etcd-certrotation /retest-required

openshift-ci[bot] commented 2 months ago

@hasbro17: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-etcd-certrotation b6532bd019dbc5ee7a9fa3eaa4dcd071a3ca4eac link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-aws-etcd-recovery b6532bd019dbc5ee7a9fa3eaa4dcd071a3ca4eac link false /test e2e-aws-etcd-recovery
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown b6532bd019dbc5ee7a9fa3eaa4dcd071a3ca4eac link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown b6532bd019dbc5ee7a9fa3eaa4dcd071a3ca4eac link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-operator-fips b6532bd019dbc5ee7a9fa3eaa4dcd071a3ca4eac link false /test e2e-operator-fips

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
hasbro17 commented 2 months ago

The rotation test seems to pass fine. Minus the invariants complaining about members being down during the node restarts. The SNO test looks unrelated:

{  fail [k8s.io/kubernetes/test/e2e/architecture/conformance.go:45]: Conformance requires at least two nodes
Error: exit with code 1
Ginkgo exit error 1: exit with code 1}

/retest-required

hasbro17 commented 2 months ago

Out of the 20 runs, 2 failed on cluster install issues, the rest succeeded. But I don't think that's related to this change. Going to try another 10 runs, and baremetal for good measure.

/payload 4.17 nightly blocking /test e2e-metal-assisted /test e2e-metal-ipi-ovn-ipv6 /payload-aggregate periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ovn-assisted 10

openshift-ci[bot] commented 2 months ago

@hasbro17: trigger 8 job(s) of type blocking for the nightly release of OCP 4.17

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d3880920-4576-11ef-96f2-1f46ff41935e-0

trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d3880920-4576-11ef-96f2-1f46ff41935e-1

hasbro17 commented 2 months ago

/payload-aggregate periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade 20

openshift-ci[bot] commented 2 months ago

@hasbro17: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ee65fca0-45f5-11ef-8543-91775bb4cb1c-0

hasbro17 commented 2 months ago

Okay the other 20 runs also look good. All of them passed the upgrade without a degraded rollout All of them tripped up on [OLM][invariant] alert/KubePodNotReady should not be at or above info in ns/openshift-marketplace which isn't related.

Can't see any InstallerController reconciliation failed: missing required resources: [configmaps: etcd-all-bundles to indicate the rollout is degraded after the upgrade.

Going to merge this so we can get some runs over the weekend to see if this shows up again on periodic-ci-openshift-release-master-ci-X.X-upgrade-from-stable-X.X-e2e-aws-ovn-upgrade

/unhold

hasbro17 commented 2 months ago

/retest-required

hasbro17 commented 2 months ago

The SNO failure looks like it precedes our change (also that test doesn't make sense for SNO if we only have 1 node). [sig-architecture] Conformance Tests should have at least two untainted nodes [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]

That test has been failing across the board on SNO environments https://search.dptools.openshift.org/?search=Conformance+Tests+should+have+at+least+two+untainted+nodes&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

https://prow.ci.openshift.org/job-history/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node

/override ci/prow/e2e-aws-ovn-single-node

openshift-ci[bot] commented 2 months ago

@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/e2e-aws-ovn-single-node

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1296#issuecomment-2240674893): >The SNO failure looks like it precedes our change (also that test doesn't make sense for SNO if we only have 1 node). >`[sig-architecture] Conformance Tests should have at least two untainted nodes [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]` > >That test has been failing across the board on SNO environments >https://search.dptools.openshift.org/?search=Conformance+Tests+should+have+at+least+two+untainted+nodes&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job > >https://prow.ci.openshift.org/job-history/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node > >/override ci/prow/e2e-aws-ovn-single-node Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
hasbro17 commented 2 months ago

/override ci/prow/e2e-metal-assisted That has passed as well, just the aggregation step failed. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregator-periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ovn-assisted/1814125750303854592

openshift-ci[bot] commented 2 months ago

@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/e2e-metal-assisted

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1296#issuecomment-2240681253): >/override ci/prow/e2e-metal-assisted >That has passed as well, just the aggregation step failed. >https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregator-periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ovn-assisted/1814125750303854592 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
hasbro17 commented 2 months ago

Don't actually need this one /override ci/prow/e2e-metal-ipi-ovn-ipv6

openshift-ci[bot] commented 2 months ago

@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/e2e-metal-ipi-ovn-ipv6

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1296#issuecomment-2240697191): >Don't actually need this one >/override ci/prow/e2e-metal-ipi-ovn-ipv6 > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-ci-robot commented 2 months ago

@hasbro17: Jira Issue OCPBUGS-36621: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-36621 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1296): >The etcdcertsigner controller would previously not run if a revision rollout was already in progress which is necessary for the distribution of CA before leaf cert generation during a CA rotation. However in the event where there is no cert rotation ongoing and the etcd-all-bundles configmap is missing the revision rollout can get stuck as the installer pod won't find the etcd-all-bundles configmap to install on disk. > >In that case the etcdcertsigner controller would never generate the etcd-all-bundles configmap as it waits for the revision rollout that's in turn waiting on the configmap to be present. > >This commit adds an override to let the controller sync and regenerate the configmap if it's missing similar to how the controller runs during the bootstrap phase. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-bot commented 2 months ago

[ART PR BUILD NOTIFIER]

Distgit: cluster-etcd-operator This PR has been included in build cluster-etcd-operator-container-v4.17.0-202407200311.p0.gc33725e.assembly.stream.el9. All builds following this will include this PR.