openshift / machine-config-operator

Apache License 2.0
245 stars 401 forks source link

OCPBUGS-35199: daemon: skip imageInspect during checkOS for PinnedImages #4402

Open hexfusion opened 2 months ago

hexfusion commented 2 months ago

This PR is a follow-up to #4347 and #3821. This PR skips the imageInspect check if PinnedImages feature gate is enabled and the osImage has been pulled locally.

openshift-ci-robot commented 2 months ago

@hexfusion: This pull request references Jira Issue OCPBUGS-35199, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4402): >This PR is a follow-up to #3821 and skips the imageInspect check if `PinnedImages` feature is enabled and the osImage has been pulled locally. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hexfusion Once this PR has been reviewed and has the lgtm label, please assign djoshy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/openshift/machine-config-operator/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 2 months ago

@hexfusion: This pull request references Jira Issue OCPBUGS-35199, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

In response to [this](https://github.com/openshift/machine-config-operator/pull/4402): >This PR is a follow-up to #4347 and #3821. This PR skips the imageInspect check if `PinnedImages` feature is enabled and the osImage has been pulled locally. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 2 months ago

@hexfusion: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-hypershift 6ce6cd39fab11661c8200252d6fd4497d8d52975 link true /test e2e-hypershift
ci/prow/e2e-gcp-op-techpreview 6ce6cd39fab11661c8200252d6fd4497d8d52975 link false /test e2e-gcp-op-techpreview
ci/prow/e2e-azure-ovn-upgrade-out-of-change 6ce6cd39fab11661c8200252d6fd4497d8d52975 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-vsphere-ovn-upi-zones 6ce6cd39fab11661c8200252d6fd4497d8d52975 link false /test e2e-vsphere-ovn-upi-zones

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
sergiordlr commented 2 months ago

We have run an upgrade in a disconnected clusters, with pinned images and using an empty pull-secret. No access to any registry.

Upgrade from 4.17.0-0.nightly-2024-06-06-061523 to 4.16.0-0.ci.test-2024-06-12-092356-ci-ln-frc8mnk-latest (ci image with our fix)

We have seen this error in the MCDs

2024-06-13T10:11:04.640197053+00:00 stderr F I0613 10:11:04.640183  151780 rpm-ostree.go:316] Running captured: podman images -q --filter reference=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307
2024-06-13T10:11:10.825691692+00:00 stderr F I0613 10:11:10.825643  151780 pinned_image_set.go:426] Completed scheduling 25% of images
2024-06-13T10:11:20.840856559+00:00 stderr F I0613 10:11:20.840805  151780 pinned_image_set.go:426] Completed scheduling 50% of images
2024-06-13T10:11:30.856404386+00:00 stderr F I0613 10:11:30.856359  151780 pinned_image_set.go:426] Completed scheduling 75% of images
2024-06-13T10:11:40.872787010+00:00 stderr F I0613 10:11:40.872741  151780 pinned_image_set.go:426] Completed scheduling 100% of images
2024-06-13T10:11:42.981134742+00:00 stderr F I0613 10:11:42.981101  151780 pinned_image_set.go:527] CRI-O config file is up to date, no reload required
2024-06-13T10:12:03.746391202+00:00 stderr F I0613 10:12:03.746352  151780 certificate_writer.go:288] Certificate was synced from controllerconfig resourceVersion 114001
2024-06-13T10:12:04.720906853+00:00 stderr F time="2024-06-13T10:12:04Z" level=warning msg="Failed, retrying in 1s ... (1/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 54.86.200.156:443: i/o timeout"
2024-06-13T10:13:05.741726483+00:00 stderr F time="2024-06-13T10:13:05Z" level=warning msg="Failed, retrying in 2s ... (2/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 54.221.103.142:443: i/o timeout"
2024-06-13T10:14:03.185800816+00:00 stderr F I0613 10:14:03.185750  151780 daemon.go:1364] Shutting down MachineConfigDaemon

The image that is triggering the error is the orginal coreos image, not the target coreos image

$ oc adm release info registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-06-061523 --pullspecs| grep coreos
  rhel-coreos                                    quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307

After reporting this error for a long while, and even restarting MCDs, the configuration is eventually applied. I don't know why it MCD stops restarting and eventually it decides to apply the configuration.

2024-06-13T10:19:44.284840446+00:00 stderr F I0613 10:19:44.284800  155375 pinned_image_set.go:527] CRI-O config file is up to date, no reload required
2024-06-13T10:20:05.045533819+00:00 stderr F I0613 10:20:05.045476  155375 certificate_writer.go:288] Certificate was synced from controllerconfig resourceVersion 114001
2024-06-13T10:20:06.014479760+00:00 stderr F time="2024-06-13T10:20:06Z" level=warning msg="Failed, retrying in 1s ... (1/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 44.194.103.74:443: i/o timeout"
2024-06-13T10:21:07.035257526+00:00 stderr F time="2024-06-13T10:21:07Z" level=warning msg="Failed, retrying in 2s ... (2/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 54.173.5.6:443: i/o timeout"
2024-06-13T10:21:55.224373042+00:00 stderr F I0613 10:21:55.224329  155375 pinned_image_set.go:302] Reconciling pinned image set: 99-worker-pinned-release: generation: 1
2024-06-13T10:21:55.328957322+00:00 stderr F I0613 10:21:55.328920  155375 pinned_image_set.go:527] CRI-O config file is up to date, no reload required
2024-06-13T10:22:09.059813690+00:00 stderr F W0613 10:22:09.059770  155375 daemon.go:2620] Unable to check manifest for matching hash: error parsing image name "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307": (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 54.86.200.156:443: i/o timeout
2024-06-13T10:22:09.059813690+00:00 stderr F I0613 10:22:09.059793  155375 rpm-ostree.go:316] Running captured: rpm-ostree kargs
2024-06-13T10:22:09.171178020+00:00 stderr F I0613 10:22:09.171136  155375 update.go:2621] Validated on-disk state
2024-06-13T10:22:09.222870849+00:00 stderr F I0613 10:22:09.222830  155375 update.go:2643] Adding SIGTERM protection
2024-06-13T10:22:09.242960693+00:00 stderr F I0613 10:22:09.242923  155375 update.go:1009] Checking Reconcilable for config rendered-worker-c11566a8572146defaf95ca346654742 to rendered-worker-64602b930cbae5db1feb30e67b974b39
2024-06-13T10:22:09.288117764+00:00 stderr F I0613 10:22:09.288074  155375 update.go:2621] Starting update from rendered-worker-c11566a8572146defaf95ca346654742 to rendered-worker-64602b930cbae5db1feb30e67b974b39: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
2024-06-13T10:22:09.322117482+00:00 stderr F I0613 10:22:09.322082  155375 update.go:757] Calculating node disruption actions
2024-06-13T10:22:09.322117482+00:00 stderr F I0613 10:22:09.322111  155375 drain.go:121] Checking drain required for node disruption actions

Eventually we get an upgrade that is reported to be successful

$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.ci.test-2024-06-12-092356-ci-ln-frc8mnk-latest   True        False         90m     Cluster version is 4.16.0-0.ci.test-2024-06-12-092356-ci-ln-frc8mnk-latest

Nevertheless, we can observe that the debug command does not work because it is trying to use the old original image instead of the one corresponding to the new version in the cluster. If we observe the tools imagestream in the openshift namespace we can see that the new image could not be imported

oc get is -n openshift tools -oyaml
....
  tags:
  - conditions:
    - generation: 8
      lastTransitionTime: "2024-06-13T11:01:04Z"
      message: 'Internal error occurred: [you may not have access to the container
        image "ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocpupg/release@sha256:4ff6d13185bb1e378d8527ad6d3a3a92e11024488c1a74bffc42b0c9f8f21fd7",
        registry.build03.ci.openshift.org/ci-ln-frc8mnk/stable@sha256:4ff6d13185bb1e378d8527ad6d3a3a92e11024488c1a74bffc42b0c9f8f21fd7:
        Get "https://registry.build03.ci.openshift.org/v2/": dial tcp 54.172.72.33:443:
        i/o timeout]'
      reason: InternalError
      status: "False"
      type: ImportSuccess
    items:

We can observe similar failures in these imagestreams in the openshift namespace

    name: cli
    name: cli-artifacts
    name: driver-toolkit
    name: installer
    name: installer-artifacts
    name: must-gather
    name: network-tools
    name: oauth-proxy
    name: tests
    name: tools

Extensions are working fine after the upgrade

$ oc debug -q --image registry.build03.ci.openshift.org/ci-ln-frc8mnk/stable@sha256:4ff6d13185bb1e378d8527ad6d3a3a92e11024488c1a74bffc42b0c9f8f21fd7 node/ip-10-0-52-40 -- chroot /host rpm -q usbguard
usbguard-1.0.0-15.el9.x86_64

We hold the PR until we decide if we need to fix the "reading manifest" before merging it.

/hold