Closed hasbro17 closed 2 months ago
@hasbro17: This pull request references Jira Issue OCPBUGS-36621, which is valid. The bug has been moved to the POST state.
Requesting review from QA contact: /cc @geliu2016
The bug has been updated to refer to the pull request using the external bug tracker.
This bug has shown up rather elusively as 1 in the last 20 consecutive runs on periodic-ci-openshift-release-master-ci-X.X-upgrade-from-stable-X.X-e2e-aws-ovn-upgrade So let's do 20 aggregate runs to begin with.
/payload-aggregate periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade 20
@hasbro17: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/10d2cee0-453c-11ef-945b-bf6936f83d69-0
/hold
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: dusk125, hasbro17
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/override ci/prow/e2e-aws-ovn-etcd-scaling
@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/e2e-aws-ovn-etcd-scaling
/label acknowledge-critical-fixes-only
Since this flake is is causing some amount of CI noise, it would be good to get some runs over the soak time over the weekend to see if this makes a difference.
/test e2e-aws-etcd-certrotation /retest-required
@hasbro17: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/e2e-aws-etcd-certrotation | b6532bd019dbc5ee7a9fa3eaa4dcd071a3ca4eac | link | false | /test e2e-aws-etcd-certrotation |
ci/prow/e2e-aws-etcd-recovery | b6532bd019dbc5ee7a9fa3eaa4dcd071a3ca4eac | link | false | /test e2e-aws-etcd-recovery |
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown | b6532bd019dbc5ee7a9fa3eaa4dcd071a3ca4eac | link | false | /test e2e-metal-ovn-sno-cert-rotation-shutdown |
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown | b6532bd019dbc5ee7a9fa3eaa4dcd071a3ca4eac | link | false | /test e2e-metal-ovn-ha-cert-rotation-shutdown |
ci/prow/e2e-operator-fips | b6532bd019dbc5ee7a9fa3eaa4dcd071a3ca4eac | link | false | /test e2e-operator-fips |
Full PR test history. Your PR dashboard.
The rotation test seems to pass fine. Minus the invariants complaining about members being down during the node restarts. The SNO test looks unrelated:
{ fail [k8s.io/kubernetes/test/e2e/architecture/conformance.go:45]: Conformance requires at least two nodes
Error: exit with code 1
Ginkgo exit error 1: exit with code 1}
/retest-required
Out of the 20 runs, 2 failed on cluster install issues, the rest succeeded. But I don't think that's related to this change. Going to try another 10 runs, and baremetal for good measure.
/payload 4.17 nightly blocking /test e2e-metal-assisted /test e2e-metal-ipi-ovn-ipv6 /payload-aggregate periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ovn-assisted 10
@hasbro17: trigger 8 job(s) of type blocking for the nightly release of OCP 4.17
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d3880920-4576-11ef-96f2-1f46ff41935e-0
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d3880920-4576-11ef-96f2-1f46ff41935e-1
/payload-aggregate periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade 20
@hasbro17: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ee65fca0-45f5-11ef-8543-91775bb4cb1c-0
Okay the other 20 runs also look good. All of them passed the upgrade without a degraded rollout
All of them tripped up on [OLM][invariant] alert/KubePodNotReady should not be at or above info in ns/openshift-marketplace
which isn't related.
Can't see any InstallerController reconciliation failed: missing required resources: [configmaps: etcd-all-bundles
to indicate the rollout is degraded after the upgrade.
Going to merge this so we can get some runs over the weekend to see if this shows up again on periodic-ci-openshift-release-master-ci-X.X-upgrade-from-stable-X.X-e2e-aws-ovn-upgrade
/unhold
/retest-required
The SNO failure looks like it precedes our change (also that test doesn't make sense for SNO if we only have 1 node).
[sig-architecture] Conformance Tests should have at least two untainted nodes [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
That test has been failing across the board on SNO environments https://search.dptools.openshift.org/?search=Conformance+Tests+should+have+at+least+two+untainted+nodes&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
/override ci/prow/e2e-aws-ovn-single-node
@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/e2e-aws-ovn-single-node
/override ci/prow/e2e-metal-assisted That has passed as well, just the aggregation step failed. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregator-periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ovn-assisted/1814125750303854592
@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/e2e-metal-assisted
Don't actually need this one /override ci/prow/e2e-metal-ipi-ovn-ipv6
@hasbro17: Overrode contexts on behalf of hasbro17: ci/prow/e2e-metal-ipi-ovn-ipv6
@hasbro17: Jira Issue OCPBUGS-36621: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-36621 has been moved to the MODIFIED state.
[ART PR BUILD NOTIFIER]
Distgit: cluster-etcd-operator This PR has been included in build cluster-etcd-operator-container-v4.17.0-202407200311.p0.gc33725e.assembly.stream.el9. All builds following this will include this PR.
The etcdcertsigner controller would previously not run if a revision rollout was already in progress which is necessary for the distribution of CA before leaf cert generation during a CA rotation. However in the event where there is no cert rotation ongoing and the etcd-all-bundles configmap is missing the revision rollout can get stuck as the installer pod won't find the etcd-all-bundles configmap to install on disk.
In that case the etcdcertsigner controller would never generate the etcd-all-bundles configmap as it waits for the revision rollout that's in turn waiting on the configmap to be present.
This commit adds an override to let the controller sync and regenerate the configmap if it's missing similar to how the controller runs during the bootstrap phase.