Closed cheesesashimi closed 2 months ago
@cheesesashimi: This pull request references Jira Issue OCPBUGS-34251, which is valid.
Requesting review from QA contact: /cc @sunzhaohua2
The bug has been updated to refer to the pull request using the external bug tracker.
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
@cheesesashimi: This pull request references Jira Issue OCPBUGS-34261, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
/jira refresh
@sinnykumari: This pull request references Jira Issue OCPBUGS-34261, which is valid. The bug has been moved to the POST state.
Requesting review from QA contact: /cc @sergiordlr
I don't have full insight but overall approach looks sane.
If i understand correctly, this also includes fix from https://github.com/openshift/machine-config-operator/pull/4372 ?
@djoshy would be good to get your thoughts on this PR.
@sinnykumari That is correct. I used that PR as the base for this one because of the overlap.
Overall looks good to me, the comments and tests were super helpful. On the topic of starting informers via feature gates, we have an example in our repo here. But I think your approach is fine as well. I think @yuqi-zhang wanted to take a look as well, but from the internal registry secrets fetch/merging standpoint, this is LGTM.
/retest
/retest-required
/test e2e-gcp-op-techpreview
/retest-required
@cheesesashimi: This pull request references Jira Issue OCPBUGS-33913, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
This pull request references Jira Issue OCPBUGS-34261, which is valid.
Requesting review from QA contact: /cc @sergiordlr
The bug has been updated to refer to the pull request using the external bug tracker.
/jira refresh
@yuqi-zhang: This pull request references Jira Issue OCPBUGS-33913, which is valid. The bug has been moved to the POST state.
Requesting review from QA contact: /cc @sergiordlr
This pull request references Jira Issue OCPBUGS-34261, which is valid.
Requesting review from QA contact: /cc @sergiordlr
/test e2e-gcp-op-techpreview
/retest-required
Verification of the OCL functionality. Verified using IPI on GCP
We verified the use of the buildOutputs secret running test case:
OCP-73947 - [MCO][MCO-665][OCPBUGS-34261] OCB use OutputImage pull secret
We were able to apply images in imagestreams stored in different namespaces than MCO
We can use imagestreams stored in other namespaces
+ oc debug node/sregidor-voclsec4-nmrcd-worker-a-7gzvp -q -- chroot /host rpm-ostree status
State: idle
Deployments:
* ostree-unverified-registry:image-registry.openshift-image-registry.svc:5000/mco-tc-73947/ocb-image@sha256:49df96c5935886df93fcaf9434d219784c394e1adf289befa857ce6a9006cbe8
Digest: sha256:49df96c5935886df93fcaf9434d219784c394e1adf289befa857ce6a9006cbe8
Version: 416.94.202406042100-0 (2024-06-12T11:34:43Z)
We run the following test cases as well:
OCP-73599 - [MCO][MCO-665] OCB Validate MachineOSConfig. New 41.6 OCB API
OCP-73496 - [MCO][MCO-665] OCB use custom Containerfile. New 4.16 OCB API.
OCP-73494 - [MCO][MCO-665] OCB Wiring up Productionalized Build Controller. New 4.16 OCB API
OCP-73436 - [MCO][MCO-1100] OCB Use custom Containerfile with rhel enablement
OCP-74111 - [MCO][OCPBUGS-34079] OCB Canonicalized secrets are updated when the original secrets are updated
ISSUES FOUND:
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
creationTimestamp: "2024-06-12T11:01:57Z"
generation: 1
name: tc-73599-infra
resourceVersion: "71863"
uid: 66bd21aa-d9e8-4f21-96ac-1e57d481f0fb
spec:
# buildOutputs:
# currentImagePullSecret:
# name: "" # If we use an empty value we get the same error
buildInputs:
baseImagePullSecret:
name: fake-pull-secret
containerFile: []
imageBuilder:
imageBuilderType: PodImageBuilder
renderedImagePushSecret:
# name: clonned-pull-secret-qhlf8wk3
name: pull-copy
renderedImagePushspec: image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-infra-image:latest
machineConfigPool:
name: infra
The machine-config CO is degraded, and the error message is not actually describing the real problem.
oc get co machine-config
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
machine-config 4.16.0-0.ci.test-2024-06-12-085144-ci-ln-0lf8xsb-latest True False True 99m Failed to resync 4.16.0-0.ci.test-2024-06-12-085144-ci-ln-0lf8xsb-latest because: DaemonSet.apps "machine-config-daemon" is invalid: [spec.template.spec.volumes[4].secret.secretName: Required value, spec.template.spec.volumes[4].name: Required value, spec.template.spec.containers[0].volumeMounts[1].name: Required value, spec.template.spec.containers[0].volumeMounts[1].name: Not found: ""]
$ oc get events |grep "%\!s"
133m Normal SetDesiredConfig machineconfigpool/master Targeted node sregidor-voclsec4-nmrcd-master-0 to %!s(*string=0xc000e28148)
126m Normal SetDesiredConfig machineconfigpool/master Targeted node sregidor-voclsec4-nmrcd-master-2 to %!s(*string=0xc001129748)
121m Normal SetDesiredConfig machineconfigpool/master Targeted node sregidor-voclsec4-nmrcd-master-1 to %!s(*string=0xc002087a08)
133m Normal SetDesiredConfig machineconfigpool/worker Targeted node sregidor-voclsec4-nmrcd-worker-a-7gzvp to %!s(*string=0xc000bdb748)
128m Normal SetDesiredConfig machineconfigpool/worker Targeted node sregidor-voclsec4-nmrcd-worker-b-mzwfr to %!s(*string=0xc0015e2408)
125m Normal SetDesiredConfig machineconfigpool/worker Targeted node sregidor-voclsec4-nmrcd-worker-c-x65tq to %!s(*string=0xc001625748)
49m Normal SetDesiredConfigAndOSImage machineconfigpool/worker Targeted node sregidor-voclsec4-nmrcd-worker-a-7gzvp to %!s(*string=0xc0012f4988)
45m Normal SetDesiredConfigAndOSImage machineconfigpool/worker Targeted node sregidor-voclsec4-nmrcd-worker-b-mzwfr to %!s(*string=0xc001d306c8)
11m Normal SetDesiredConfigAndOSImage machineconfigpool/worker Targeted node sregidor-voclsec4-nmrcd-worker-a-7gzvp to %!s(*string=0xc001852148)
7m59s Normal SetDesiredConfigAndOSImage machineconfigpool/worker Targeted node sregidor-voclsec4-nmrcd-worker-b-mzwfr to %!s(*string=0xc001a6a408)
When 2 machineosbuilds are created, and we remove the MSOC, not all machineosbuilds are garbage-collected
MCP are stuck doing nothing after the OCL image is built. The time is more or less random, it can take up to 27 minutes to start applying the new image.
I0612 13:19:33.666572 1 node_controller.go:1008] Requeueing layered pool worker: Desired Image not set in MachineOSBuild I0612 13:19:38.722049 1 node_controller.go:1008] Requeueing layered pool worker: Desired Image not set in MachineOSBuild I0612 13:33:40.247919 1 node_controller.go:566] Pool worker: 2 candidate nodes in 1 zones for update, capacity: 1 I0612 13:33:40.648685 1 node_controller.go:1328] Continuing to sync layered MachineConfigPool worker
The only issue that seems to be 100% related to this PR is the one reporting a bad message when the buildOutpus secret is missing or misconfigured. Nevertheless, could you please confirm that the issues are not related to this PR, please?
Verification of the internal-regsitry-pull-secret file
Passed:
"[sig-mco] MCO Layering Author:sregidor-ConnectedOnly-Longduration-NonPreRelease-Medium-54056-Update osImage using the internal registry to store the image [Disruptive] [Serial]"
The file is not present in the rendered machine-configs
$ oc get mc -oyaml rendered-worker-de597a802ef5cd27f6c53083f93a89c3 | grep "path:" | grep internal
$ oc get mc -oyaml rendered-master-13a425be563ebc405e0550d40f658ee3 | grep "path:" | grep internal
The file exists in the nodes and has the same content stored in .spec.internalRegistryPullSecret in the controllerconfig resource.
No issues found.
Answers inline:
When we create a MOSC without buildOutputs secret mchine-config CO is degrade with the wrong message
I feel like this shouldn't be possible given API validation, though evidently, it is possible. To prevent this, we can check for the presence of all of the secrets referenced by the MachineOSConfig and emit an error if an expected secret is not present. I feel like such a thing would also be covered by API validation, but I don't know for sure.
Events seem to show a wrong message, bad format
That feels like a regression though I don't think it was caused by this PR. I will take a look though since the fix should be pretty simple.
When 2 machineosbuilds are created, and we remove the MSOC, not all machineosbuilds are garbage-collected
I don't think we have a good story around garbage collection right now. That said, I suspect the root cause here is that the machine-os-builder
pod might be shutting down before the second deletion can take place. I don't think that was introduced in this PR, however.
MCP are stuck doing nothing after the OCL image is built. The time is more or less random, it can take up to 27 minutes to start applying the new image.
Question: When the MCP is stuck in this state, what are the MCDs doing? Specifically, what I'm asking is: Is the DaemonSet rolling out a new MCD? Or are they all up and running? Additionally, do they all have the secret volume mounts for the secrets present in the MachineOSConfigs? This will help clarify whether this PR is what is causing this or if the root cause lies elsewhere.
@cheesesashimi I have taken a must-gather file while the cluster is stuck after building the image You can find the must gather file here: https://issues.redhat.com/browse/OCPBUGS-34261?focusedId=24922586&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24922586
For posterity, I looked at the must-gather provided by @sergiordlr and also looked at the cluster he was running everything on. I observed a few interesting things, but cannot definitively determine whether this PR introduces this rollout delay. Here's why:
Theoretically, what could happen is that the MCD pod could fail to start if the secret it is configured to use does not actually exist (unless one sets optional: true
on the volume config, which we are not doing). Indeed, I saw that specific scenario a few times while I was trying to write the e2e test for this. In that scenario, replacement MCD pods would be blocked from rolling out given the default rollout strategy we're using for them. Similarly, if one of the MCD pods is configured to block SIGTERMS (like it does whenever an update is in process), that could block the rollout of replacement MCD pods since the DaemonSet controller will not be able to terminate that pod via a SIGTERM.
The following solutions exist for both of those scenarios:
Update: I was able to reproduce the MCD health / liveness check in my sandbox cluster. While this occurred, I was able to stream the pod logs using k9s. I observed that the reason why the checks were temporarily failing is because we are starting up the health endpoint fairly late in the MCD startup process. At this point, I can safely rule that out as being the root cause of the delayed image rollout once the image is built.
Verified using IPI on GCP
Now the secrets are reporting a descriptive error when they are not properly configured
Now the events are displaying a properly formatted message
$ oc get events |grep SetDesiredConfig
75m Normal SetDesiredConfigAndOSImage machineconfigpool/worker Targeted node sregidor-voclsec6-knd2c-worker-a-m2krj to MachineConfig: rendered-worker-269adf05864ad9d63d7954781af64f0b / Image: image-registry.openshift-image-registry.svc:5000/mco-tc-73947/ocb-image@sha256:1349bbff7e3e4a2e3314ae98af7b96a968c0aa782adee746998bf828e9204f13
$ oc get events |grep SetDesiredConfigAndOSImage
75m Normal SetDesiredConfigAndOSImage machineconfigpool/worker Targeted node sregidor-voclsec6-knd2c-worker-a-m2krj to MachineConfig: rendered-worker-269adf05864ad9d63d7954781af64f0b / Image: image-registry.openshift-image-registry.svc:5000/mco-tc-73947/ocb-image@sha256:1349bbff7e3e4a2e3314ae98af7b96a968c0aa782adee746998bf828e9204f13
Regarding the time gap between MCO builds the image and MCO starts applying the image, we have created a new jira ticket to track this issue https://issues.redhat.com/browse/OCPBUGS-35509
Regarding the machineosbuild resources that are not garbage collected, we have created a new jira ticket to track this behaviour https://issues.redhat.com/browse/OCPBUGS-35512
We have verified that with this fix OCL functionality can properly use the buildOutputs.currentImagePullSecret secret. Hence, we add the qe-approved label. The issues mentioned in this PR will be fixed and tracked using the jira tickets that we have created for that.
/label qe-approved
@cheesesashimi: This pull request references Jira Issue OCPBUGS-33913, which is valid.
Requesting review from QA contact: /cc @sergiordlr
This pull request references Jira Issue OCPBUGS-34261, which is valid.
Requesting review from QA contact: /cc @sergiordlr
Regarding the update latency bug (OCPBUGS-35509): This PR does not introduce that problem. Instead, it is caused by the lack of informers within NodeController for both the MachineOSConfig and MachineOSBuild objects. See bug for further discussion.
/test test-unit
@cheesesashimi: The specified target(s) for /test
were not found.
The following commands are available to trigger required jobs:
/test 4.12-upgrade-from-stable-4.11-images
/test cluster-bootimages
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test e2e-gcp-op-single-node
/test e2e-hypershift
/test images
/test unit
/test verify
The following commands are available to trigger optional jobs:
/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
/test bootstrap-unit
/test e2e-aws-disruptive
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-aws-ovn-upgrade-out-of-change
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-ovn-upgrade-out-of-change
/test e2e-azure-upgrade
/test e2e-gcp-op-techpreview
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-metal-assisted
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-dualstack
/test e2e-openstack-externallb
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-ovn-upi
/test e2e-vsphere-ovn-upi-zones
/test e2e-vsphere-ovn-zones
/test e2e-vsphere-upgrade
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-images
/test security
Use /test all
to run the following jobs that were automatically triggered:
pull-ci-openshift-machine-config-operator-master-bootstrap-unit
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-techpreview
pull-ci-openshift-machine-config-operator-master-e2e-hypershift
pull-ci-openshift-machine-config-operator-master-e2e-vsphere-ovn-upi
pull-ci-openshift-machine-config-operator-master-e2e-vsphere-ovn-upi-zones
pull-ci-openshift-machine-config-operator-master-e2e-vsphere-ovn-zones
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-security
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify
/test unit
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: cheesesashimi, yuqi-zhang
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/unhold
/retest-required /test e2e-vsphere-ovn-upi-zones /test e2e-azure-ovn-upgrade-out-of-change
/retest-required
Remaining retests: 0 against base HEAD 2e1b6bc163c3cfb6a0261b26d20e9897823ba156 and 2 for PR HEAD 521d36a9145296ef46c1419382871816b8d29301 in total
@cheesesashimi: Jira Issue OCPBUGS-33913: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-33913 has been moved to the MODIFIED state.
Jira Issue OCPBUGS-34261: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-34261 has been moved to the MODIFIED state.
[ART PR BUILD NOTIFIER]
This PR has been included in build ose-machine-config-operator-container-v4.17.0-202406250241.p0.g58aca69.assembly.stream.el9 for distgit ose-machine-config-operator. All builds following this will include this PR.
/cherry-pick release-4.16
@cheesesashimi: #4395 failed to apply on top of branch "release-4.16":
Applying: certificatewriter should handle the internal registry pull secret
Using index info to reconstruct a base tree...
M pkg/controller/template/template_controller.go
M pkg/daemon/update.go
M pkg/operator/sync.go
M test/e2e-techpreview/helpers_test.go
M test/helpers/utils.go
Falling back to patching base and 3-way merge...
Auto-merging test/helpers/utils.go
CONFLICT (content): Merge conflict in test/helpers/utils.go
Auto-merging test/e2e-techpreview/helpers_test.go
Auto-merging pkg/operator/sync.go
Auto-merging pkg/daemon/update.go
Auto-merging pkg/controller/template/template_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 certificatewriter should handle the internal registry pull secret
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
- What I did
The CustomImagePullSecret that is referred to by a MachineOSConfig was not being used. This means that pulling an OS image from a private image registry will not work. This work does the following:
- How to verify it
- Description for the changelog CurrentImagePullSecret should be used by MachineConfigDaemon