Open hexfusion opened 2 months ago
@hexfusion: This pull request references Jira Issue OCPBUGS-35199, which is valid. The bug has been moved to the POST state.
Requesting review from QA contact: /cc @sergiordlr
The bug has been updated to refer to the pull request using the external bug tracker.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: hexfusion Once this PR has been reviewed and has the lgtm label, please assign djoshy for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
@hexfusion: This pull request references Jira Issue OCPBUGS-35199, which is valid.
Requesting review from QA contact: /cc @sergiordlr
@hexfusion: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/e2e-hypershift | 6ce6cd39fab11661c8200252d6fd4497d8d52975 | link | true | /test e2e-hypershift |
ci/prow/e2e-gcp-op-techpreview | 6ce6cd39fab11661c8200252d6fd4497d8d52975 | link | false | /test e2e-gcp-op-techpreview |
ci/prow/e2e-azure-ovn-upgrade-out-of-change | 6ce6cd39fab11661c8200252d6fd4497d8d52975 | link | false | /test e2e-azure-ovn-upgrade-out-of-change |
ci/prow/e2e-vsphere-ovn-upi-zones | 6ce6cd39fab11661c8200252d6fd4497d8d52975 | link | false | /test e2e-vsphere-ovn-upi-zones |
Full PR test history. Your PR dashboard.
We have run an upgrade in a disconnected clusters, with pinned images and using an empty pull-secret. No access to any registry.
Upgrade from 4.17.0-0.nightly-2024-06-06-061523 to 4.16.0-0.ci.test-2024-06-12-092356-ci-ln-frc8mnk-latest (ci image with our fix)
We have seen this error in the MCDs
2024-06-13T10:11:04.640197053+00:00 stderr F I0613 10:11:04.640183 151780 rpm-ostree.go:316] Running captured: podman images -q --filter reference=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307
2024-06-13T10:11:10.825691692+00:00 stderr F I0613 10:11:10.825643 151780 pinned_image_set.go:426] Completed scheduling 25% of images
2024-06-13T10:11:20.840856559+00:00 stderr F I0613 10:11:20.840805 151780 pinned_image_set.go:426] Completed scheduling 50% of images
2024-06-13T10:11:30.856404386+00:00 stderr F I0613 10:11:30.856359 151780 pinned_image_set.go:426] Completed scheduling 75% of images
2024-06-13T10:11:40.872787010+00:00 stderr F I0613 10:11:40.872741 151780 pinned_image_set.go:426] Completed scheduling 100% of images
2024-06-13T10:11:42.981134742+00:00 stderr F I0613 10:11:42.981101 151780 pinned_image_set.go:527] CRI-O config file is up to date, no reload required
2024-06-13T10:12:03.746391202+00:00 stderr F I0613 10:12:03.746352 151780 certificate_writer.go:288] Certificate was synced from controllerconfig resourceVersion 114001
2024-06-13T10:12:04.720906853+00:00 stderr F time="2024-06-13T10:12:04Z" level=warning msg="Failed, retrying in 1s ... (1/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 54.86.200.156:443: i/o timeout"
2024-06-13T10:13:05.741726483+00:00 stderr F time="2024-06-13T10:13:05Z" level=warning msg="Failed, retrying in 2s ... (2/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 54.221.103.142:443: i/o timeout"
2024-06-13T10:14:03.185800816+00:00 stderr F I0613 10:14:03.185750 151780 daemon.go:1364] Shutting down MachineConfigDaemon
The image that is triggering the error is the orginal coreos image, not the target coreos image
$ oc adm release info registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-06-061523 --pullspecs| grep coreos
rhel-coreos quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307
After reporting this error for a long while, and even restarting MCDs, the configuration is eventually applied. I don't know why it MCD stops restarting and eventually it decides to apply the configuration.
2024-06-13T10:19:44.284840446+00:00 stderr F I0613 10:19:44.284800 155375 pinned_image_set.go:527] CRI-O config file is up to date, no reload required
2024-06-13T10:20:05.045533819+00:00 stderr F I0613 10:20:05.045476 155375 certificate_writer.go:288] Certificate was synced from controllerconfig resourceVersion 114001
2024-06-13T10:20:06.014479760+00:00 stderr F time="2024-06-13T10:20:06Z" level=warning msg="Failed, retrying in 1s ... (1/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 44.194.103.74:443: i/o timeout"
2024-06-13T10:21:07.035257526+00:00 stderr F time="2024-06-13T10:21:07Z" level=warning msg="Failed, retrying in 2s ... (2/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 54.173.5.6:443: i/o timeout"
2024-06-13T10:21:55.224373042+00:00 stderr F I0613 10:21:55.224329 155375 pinned_image_set.go:302] Reconciling pinned image set: 99-worker-pinned-release: generation: 1
2024-06-13T10:21:55.328957322+00:00 stderr F I0613 10:21:55.328920 155375 pinned_image_set.go:527] CRI-O config file is up to date, no reload required
2024-06-13T10:22:09.059813690+00:00 stderr F W0613 10:22:09.059770 155375 daemon.go:2620] Unable to check manifest for matching hash: error parsing image name "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307": (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 54.86.200.156:443: i/o timeout
2024-06-13T10:22:09.059813690+00:00 stderr F I0613 10:22:09.059793 155375 rpm-ostree.go:316] Running captured: rpm-ostree kargs
2024-06-13T10:22:09.171178020+00:00 stderr F I0613 10:22:09.171136 155375 update.go:2621] Validated on-disk state
2024-06-13T10:22:09.222870849+00:00 stderr F I0613 10:22:09.222830 155375 update.go:2643] Adding SIGTERM protection
2024-06-13T10:22:09.242960693+00:00 stderr F I0613 10:22:09.242923 155375 update.go:1009] Checking Reconcilable for config rendered-worker-c11566a8572146defaf95ca346654742 to rendered-worker-64602b930cbae5db1feb30e67b974b39
2024-06-13T10:22:09.288117764+00:00 stderr F I0613 10:22:09.288074 155375 update.go:2621] Starting update from rendered-worker-c11566a8572146defaf95ca346654742 to rendered-worker-64602b930cbae5db1feb30e67b974b39: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
2024-06-13T10:22:09.322117482+00:00 stderr F I0613 10:22:09.322082 155375 update.go:757] Calculating node disruption actions
2024-06-13T10:22:09.322117482+00:00 stderr F I0613 10:22:09.322111 155375 drain.go:121] Checking drain required for node disruption actions
Eventually we get an upgrade that is reported to be successful
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.16.0-0.ci.test-2024-06-12-092356-ci-ln-frc8mnk-latest True False 90m Cluster version is 4.16.0-0.ci.test-2024-06-12-092356-ci-ln-frc8mnk-latest
Nevertheless, we can observe that the debug command does not work because it is trying to use the old original image instead of the one corresponding to the new version in the cluster. If we observe the tools imagestream in the openshift namespace we can see that the new image could not be imported
oc get is -n openshift tools -oyaml
....
tags:
- conditions:
- generation: 8
lastTransitionTime: "2024-06-13T11:01:04Z"
message: 'Internal error occurred: [you may not have access to the container
image "ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocpupg/release@sha256:4ff6d13185bb1e378d8527ad6d3a3a92e11024488c1a74bffc42b0c9f8f21fd7",
registry.build03.ci.openshift.org/ci-ln-frc8mnk/stable@sha256:4ff6d13185bb1e378d8527ad6d3a3a92e11024488c1a74bffc42b0c9f8f21fd7:
Get "https://registry.build03.ci.openshift.org/v2/": dial tcp 54.172.72.33:443:
i/o timeout]'
reason: InternalError
status: "False"
type: ImportSuccess
items:
We can observe similar failures in these imagestreams in the openshift namespace
name: cli
name: cli-artifacts
name: driver-toolkit
name: installer
name: installer-artifacts
name: must-gather
name: network-tools
name: oauth-proxy
name: tests
name: tools
Extensions are working fine after the upgrade
$ oc debug -q --image registry.build03.ci.openshift.org/ci-ln-frc8mnk/stable@sha256:4ff6d13185bb1e378d8527ad6d3a3a92e11024488c1a74bffc42b0c9f8f21fd7 node/ip-10-0-52-40 -- chroot /host rpm -q usbguard
usbguard-1.0.0-15.el9.x86_64
We hold the PR until we decide if we need to fix the "reading manifest" before merging it.
/hold
This PR is a follow-up to #4347 and #3821. This PR skips the imageInspect check if
PinnedImages
feature gate is enabled and the osImage has been pulled locally.