OCPBUGS-33913, OCPBUGS-34261: CurrentImagePullSecret should be consumed by the MCD

cheesesashimi commented 3 months ago

- What I did

The CustomImagePullSecret that is referred to by a MachineOSConfig was not being used. This means that pulling an OS image from a private image registry will not work. This work does the following:

Migrates functions for changing the formatting of pull secrets from the BuildController package into controller/common, allowing reuse.
Adds secret volume mounts to the MachineConfigDaemon DaemonSet as well as template rendering logic to collect that information whenever a MachineOSConfig is changed.
Adds an MCD-side merge process that reads the secrets from the volume mounts, reads the secrets provided by the ControllerConfig, and then merges them into /etc/mco/internal-image-pull-secret.json.
Adds unit and e2e tests to validate that a secret referenced by a MachineOSConfig is written to the nodes' filesystem, overriding any values from the ControllerConfig.

- How to verify it

Bring up a cluster with the TechPreviewNoUpgrade feature-gate added.
Set up a MachineOSConfig to use a private image registry for both pushing and pulling.
Create a MachineConfigPool and a MachineOSConfig and wait for the build to complete. While this is occurring the MCD DaemonSet will restart since it will now have a secret mounted into it. If multiple MachineOSConfigs are present, multiple secret volumes will be mounted into the MCD pod, even if they are the same secret.
Move a node into the new MachineConfigPool and wait for it to drain and reboot.
When the node comes back up, it should be booted into the OS image that was pulled from the private image registry.

- Description for the changelog CurrentImagePullSecret should be used by MachineConfigDaemon

openshift-ci-robot commented 3 months ago

@cheesesashimi: This pull request references Jira Issue OCPBUGS-34251, which is valid.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sunzhaohua2

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4395): >**- What I did** > >The CustomImagePullSecret that is referred to by a MachineOSConfig was not being used. This means that pulling an OS image from a private image registry will not work. This work does the following: > >1. Migrates functions for changing the formatting of pull secrets from the BuildController package into controller/common, allowing reuse. >2. Adds secret volume mounts to the MachineConfigDaemon DaemonSet as well as template rendering logic to collect that information whenever a MachineOSConfig is changed. >3. Adds an MCD-side merge process that reads the secrets from the volume mounts, reads the secrets provided by the ControllerConfig, and then merges them into /etc/mco/internal-image-pull-secret.json. >4. Adds unit and e2e tests to validate that a secret referenced by a MachineOSConfig is written to the nodes' filesystem, overriding any values from the ControllerConfig. > >**- How to verify it** > >1. Bring up a cluster with the TechPreviewNoUpgrade feature-gate added. >2. Set up a MachineOSConfig to use a private image registry for both pushing and pulling. >3. Create a MachineConfigPool and a MachineOSConfig and wait for the build to complete. While this is occurring the MCD DaemonSet will restart since it will now have a secret mounted into it. If multiple MachineOSConfigs are present, multiple secret volumes will be mounted into the MCD pod, even if they are the same secret. >4. Move a node into the new MachineConfigPool and wait for it to drain and reboot. >5. When the node comes back up, it should be booted into the OS image that was pulled from the private image registry. > >**- Description for the changelog** >CurrentImagePullSecret should be used by MachineConfigDaemon > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

openshift-ci[bot] commented 3 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

openshift-ci-robot commented 3 months ago

@cheesesashimi: This pull request references Jira Issue OCPBUGS-34261, which is invalid:

expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4395): >**- What I did** > >The CustomImagePullSecret that is referred to by a MachineOSConfig was not being used. This means that pulling an OS image from a private image registry will not work. This work does the following: > >1. Migrates functions for changing the formatting of pull secrets from the BuildController package into controller/common, allowing reuse. >2. Adds secret volume mounts to the MachineConfigDaemon DaemonSet as well as template rendering logic to collect that information whenever a MachineOSConfig is changed. >3. Adds an MCD-side merge process that reads the secrets from the volume mounts, reads the secrets provided by the ControllerConfig, and then merges them into /etc/mco/internal-image-pull-secret.json. >4. Adds unit and e2e tests to validate that a secret referenced by a MachineOSConfig is written to the nodes' filesystem, overriding any values from the ControllerConfig. > >**- How to verify it** > >1. Bring up a cluster with the TechPreviewNoUpgrade feature-gate added. >2. Set up a MachineOSConfig to use a private image registry for both pushing and pulling. >3. Create a MachineConfigPool and a MachineOSConfig and wait for the build to complete. While this is occurring the MCD DaemonSet will restart since it will now have a secret mounted into it. If multiple MachineOSConfigs are present, multiple secret volumes will be mounted into the MCD pod, even if they are the same secret. >4. Move a node into the new MachineConfigPool and wait for it to drain and reboot. >5. When the node comes back up, it should be booted into the OS image that was pulled from the private image registry. > >**- Description for the changelog** >CurrentImagePullSecret should be used by MachineConfigDaemon > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

sinnykumari commented 2 months ago

/jira refresh

openshift-ci-robot commented 2 months ago

@sinnykumari: This pull request references Jira Issue OCPBUGS-34261, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

In response to [this](https://github.com/openshift/machine-config-operator/pull/4395#issuecomment-2157773142): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

sinnykumari commented 2 months ago

I don't have full insight but overall approach looks sane.
If i understand correctly, this also includes fix from https://github.com/openshift/machine-config-operator/pull/4372 ? @djoshy would be good to get your thoughts on this PR.

cheesesashimi commented 2 months ago

@sinnykumari That is correct. I used that PR as the base for this one because of the overlap.

djoshy commented 2 months ago

Overall looks good to me, the comments and tests were super helpful. On the topic of starting informers via feature gates, we have an example in our repo here. But I think your approach is fine as well. I think @yuqi-zhang wanted to take a look as well, but from the internal registry secrets fetch/merging standpoint, this is LGTM.

sergiordlr commented 2 months ago

/retest

cheesesashimi commented 2 months ago

/retest-required

cheesesashimi commented 2 months ago

/test e2e-gcp-op-techpreview

inesqyx commented 2 months ago

/retest-required

openshift-ci-robot commented 2 months ago

@cheesesashimi: This pull request references Jira Issue OCPBUGS-33913, which is invalid:

expected the bug to target only the "4.17.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-34261, which is valid.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4395): >**- What I did** > >The CustomImagePullSecret that is referred to by a MachineOSConfig was not being used. This means that pulling an OS image from a private image registry will not work. This work does the following: > >1. Migrates functions for changing the formatting of pull secrets from the BuildController package into controller/common, allowing reuse. >2. Adds secret volume mounts to the MachineConfigDaemon DaemonSet as well as template rendering logic to collect that information whenever a MachineOSConfig is changed. >3. Adds an MCD-side merge process that reads the secrets from the volume mounts, reads the secrets provided by the ControllerConfig, and then merges them into /etc/mco/internal-image-pull-secret.json. >4. Adds unit and e2e tests to validate that a secret referenced by a MachineOSConfig is written to the nodes' filesystem, overriding any values from the ControllerConfig. > >**- How to verify it** > >1. Bring up a cluster with the TechPreviewNoUpgrade feature-gate added. >2. Set up a MachineOSConfig to use a private image registry for both pushing and pulling. >3. Create a MachineConfigPool and a MachineOSConfig and wait for the build to complete. While this is occurring the MCD DaemonSet will restart since it will now have a secret mounted into it. If multiple MachineOSConfigs are present, multiple secret volumes will be mounted into the MCD pod, even if they are the same secret. >4. Move a node into the new MachineConfigPool and wait for it to drain and reboot. >5. When the node comes back up, it should be booted into the OS image that was pulled from the private image registry. > >**- Description for the changelog** >CurrentImagePullSecret should be used by MachineConfigDaemon > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

yuqi-zhang commented 2 months ago

/jira refresh

openshift-ci-robot commented 2 months ago

@yuqi-zhang: This pull request references Jira Issue OCPBUGS-33913, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

This pull request references Jira Issue OCPBUGS-34261, which is valid.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

In response to [this](https://github.com/openshift/machine-config-operator/pull/4395#issuecomment-2161792482): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

cheesesashimi commented 2 months ago

/test e2e-gcp-op-techpreview

inesqyx commented 2 months ago

/retest-required

sergiordlr commented 2 months ago

Verification of the OCL functionality. Verified using IPI on GCP

We verified the use of the buildOutputs secret running test case:

OCP-73947 - [MCO][MCO-665][OCPBUGS-34261] OCB use OutputImage pull secret

We were able to apply images in imagestreams stored in different namespaces than MCO

We can use imagestreams stored in other namespaces

+ oc debug node/sregidor-voclsec4-nmrcd-worker-a-7gzvp -q -- chroot /host rpm-ostree status
State: idle
Deployments:
* ostree-unverified-registry:image-registry.openshift-image-registry.svc:5000/mco-tc-73947/ocb-image@sha256:49df96c5935886df93fcaf9434d219784c394e1adf289befa857ce6a9006cbe8
                   Digest: sha256:49df96c5935886df93fcaf9434d219784c394e1adf289befa857ce6a9006cbe8
                  Version: 416.94.202406042100-0 (2024-06-12T11:34:43Z)

We run the following test cases as well:

OCP-73599 - [MCO][MCO-665] OCB Validate MachineOSConfig. New 41.6 OCB API
OCP-73496 - [MCO][MCO-665] OCB use custom Containerfile. New 4.16 OCB API.
OCP-73494 - [MCO][MCO-665] OCB Wiring up Productionalized Build Controller. New 4.16 OCB API
OCP-73436 - [MCO][MCO-1100] OCB Use custom Containerfile with rhel enablement
OCP-74111 - [MCO][OCPBUGS-34079] OCB Canonicalized secrets are updated when the original secrets are updated

ISSUES FOUND:

When we create a MOSC without buildOutputs secret mchine-config CO is degrade with the wrong message:

  apiVersion: machineconfiguration.openshift.io/v1alpha1
  kind: MachineOSConfig
  metadata:
    creationTimestamp: "2024-06-12T11:01:57Z"
    generation: 1
    name: tc-73599-infra
    resourceVersion: "71863"
    uid: 66bd21aa-d9e8-4f21-96ac-1e57d481f0fb
  spec:
 #   buildOutputs:
 #     currentImagePullSecret:
 #       name: ""   # If we use an empty value we get the same error
    buildInputs:
      baseImagePullSecret:
        name: fake-pull-secret
      containerFile: []
      imageBuilder:
        imageBuilderType: PodImageBuilder
      renderedImagePushSecret:
#        name: clonned-pull-secret-qhlf8wk3
        name: pull-copy
      renderedImagePushspec: image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-infra-image:latest
    machineConfigPool:
      name: infra

The machine-config CO is degraded, and the error message is not actually describing the real problem.

oc get co machine-config
NAME             VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.16.0-0.ci.test-2024-06-12-085144-ci-ln-0lf8xsb-latest   True        False         True       99m     Failed to resync 4.16.0-0.ci.test-2024-06-12-085144-ci-ln-0lf8xsb-latest because: DaemonSet.apps "machine-config-daemon" is invalid: [spec.template.spec.volumes[4].secret.secretName: Required value, spec.template.spec.volumes[4].name: Required value, spec.template.spec.containers[0].volumeMounts[1].name: Required value, spec.template.spec.containers[0].volumeMounts[1].name: Not found: ""]

Events seem to show a wrong message, bad format

$ oc get events |grep "%\!s"
133m        Normal    SetDesiredConfig                                  machineconfigpool/master                                          Targeted node sregidor-voclsec4-nmrcd-master-0 to %!s(*string=0xc000e28148)
126m        Normal    SetDesiredConfig                                  machineconfigpool/master                                          Targeted node sregidor-voclsec4-nmrcd-master-2 to %!s(*string=0xc001129748)
121m        Normal    SetDesiredConfig                                  machineconfigpool/master                                          Targeted node sregidor-voclsec4-nmrcd-master-1 to %!s(*string=0xc002087a08)
133m        Normal    SetDesiredConfig                                  machineconfigpool/worker                                          Targeted node sregidor-voclsec4-nmrcd-worker-a-7gzvp to %!s(*string=0xc000bdb748)
128m        Normal    SetDesiredConfig                                  machineconfigpool/worker                                          Targeted node sregidor-voclsec4-nmrcd-worker-b-mzwfr to %!s(*string=0xc0015e2408)
125m        Normal    SetDesiredConfig                                  machineconfigpool/worker                                          Targeted node sregidor-voclsec4-nmrcd-worker-c-x65tq to %!s(*string=0xc001625748)
49m         Normal    SetDesiredConfigAndOSImage                        machineconfigpool/worker                                          Targeted node sregidor-voclsec4-nmrcd-worker-a-7gzvp to %!s(*string=0xc0012f4988)
45m         Normal    SetDesiredConfigAndOSImage                        machineconfigpool/worker                                          Targeted node sregidor-voclsec4-nmrcd-worker-b-mzwfr to %!s(*string=0xc001d306c8)
11m         Normal    SetDesiredConfigAndOSImage                        machineconfigpool/worker                                          Targeted node sregidor-voclsec4-nmrcd-worker-a-7gzvp to %!s(*string=0xc001852148)
7m59s       Normal    SetDesiredConfigAndOSImage                        machineconfigpool/worker                                          Targeted node sregidor-voclsec4-nmrcd-worker-b-mzwfr to %!s(*string=0xc001a6a408)

When 2 machineosbuilds are created, and we remove the MSOC, not all machineosbuilds are garbage-collected
MCP are stuck doing nothing after the OCL image is built. The time is more or less random, it can take up to 27 minutes to start applying the new image.

I0612 13:19:33.666572 1 node_controller.go:1008] Requeueing layered pool worker: Desired Image not set in MachineOSBuild I0612 13:19:38.722049 1 node_controller.go:1008] Requeueing layered pool worker: Desired Image not set in MachineOSBuild I0612 13:33:40.247919 1 node_controller.go:566] Pool worker: 2 candidate nodes in 1 zones for update, capacity: 1 I0612 13:33:40.648685 1 node_controller.go:1328] Continuing to sync layered MachineConfigPool worker



The only issue that seems to be 100% related to this PR is the one reporting a bad message when the buildOutpus secret is missing or misconfigured. Nevertheless, could you please confirm that the issues are not related to this PR, please?

sergiordlr commented 2 months ago

Verification of the internal-regsitry-pull-secret file

Passed: "[sig-mco] MCO Layering Author:sregidor-ConnectedOnly-Longduration-NonPreRelease-Medium-54056-Update osImage using the internal registry to store the image [Disruptive] [Serial]"

The file is not present in the rendered machine-configs

$ oc get mc -oyaml rendered-worker-de597a802ef5cd27f6c53083f93a89c3 | grep "path:" | grep internal
$ oc get mc -oyaml rendered-master-13a425be563ebc405e0550d40f658ee3  | grep "path:" | grep internal

The file exists in the nodes and has the same content stored in .spec.internalRegistryPullSecret in the controllerconfig resource.

No issues found.

cheesesashimi commented 2 months ago

Answers inline:

When we create a MOSC without buildOutputs secret mchine-config CO is degrade with the wrong message

I feel like this shouldn't be possible given API validation, though evidently, it is possible. To prevent this, we can check for the presence of all of the secrets referenced by the MachineOSConfig and emit an error if an expected secret is not present. I feel like such a thing would also be covered by API validation, but I don't know for sure.

Events seem to show a wrong message, bad format

That feels like a regression though I don't think it was caused by this PR. I will take a look though since the fix should be pretty simple.

When 2 machineosbuilds are created, and we remove the MSOC, not all machineosbuilds are garbage-collected

I don't think we have a good story around garbage collection right now. That said, I suspect the root cause here is that the machine-os-builder pod might be shutting down before the second deletion can take place. I don't think that was introduced in this PR, however.

MCP are stuck doing nothing after the OCL image is built. The time is more or less random, it can take up to 27 minutes to start applying the new image.

Question: When the MCP is stuck in this state, what are the MCDs doing? Specifically, what I'm asking is: Is the DaemonSet rolling out a new MCD? Or are they all up and running? Additionally, do they all have the secret volume mounts for the secrets present in the MachineOSConfigs? This will help clarify whether this PR is what is causing this or if the root cause lies elsewhere.

sergiordlr commented 2 months ago

@cheesesashimi I have taken a must-gather file while the cluster is stuck after building the image You can find the must gather file here: https://issues.redhat.com/browse/OCPBUGS-34261?focusedId=24922586&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24922586

cheesesashimi commented 2 months ago

For posterity, I looked at the must-gather provided by @sergiordlr and also looked at the cluster he was running everything on. I observed a few interesting things, but cannot definitively determine whether this PR introduces this rollout delay. Here's why:

When the MCO updates the DaemonSet object for the MachineConfigDaemon, all of the pods restart in order to pick up the new secrets. This occurs while the build is running, but a subset of the MCD pods fail the health / liveness check. I was unable to see logs indicating what was happening inside those MCD pods in order to determine whether it was related to the changes this PR makes.
The NodeController does not begin to set annotations on the nodes to roll out the new config for quite some time after the build has completed. Meanwhile, just updating the MCD DaemonSet object should not cause the NodeController to not select a given node for updates.

Theoretically, what could happen is that the MCD pod could fail to start if the secret it is configured to use does not actually exist (unless one sets optional: true on the volume config, which we are not doing). Indeed, I saw that specific scenario a few times while I was trying to write the e2e test for this. In that scenario, replacement MCD pods would be blocked from rolling out given the default rollout strategy we're using for them. Similarly, if one of the MCD pods is configured to block SIGTERMS (like it does whenever an update is in process), that could block the rollout of replacement MCD pods since the DaemonSet controller will not be able to terminate that pod via a SIGTERM.

The following solutions exist for both of those scenarios:

I've added a validator to that path to ensure that those secrets exist before modifying the DaemonSet object.
We can update the rollout strategy to tolerate more than one MCD pod being unavailable at a given time. I'd like to add that only if we've validated that we've run into issues with SIGTERMs. To better inform whether we're running into this specific scenario, I've added some code to emit events whenever we enable or disable SIGTERM protection as well as when attempted termination of the MCD pod occurs while SIGTERM protection is active.

cheesesashimi commented 2 months ago

Update: I was able to reproduce the MCD health / liveness check in my sandbox cluster. While this occurred, I was able to stream the pod logs using k9s. I observed that the reason why the checks were temporarily failing is because we are starting up the health endpoint fairly late in the MCD startup process. At this point, I can safely rule that out as being the root cause of the delayed image rollout once the image is built.

sergiordlr commented 2 months ago

Verified using IPI on GCP

Now the secrets are reporting a descriptive error when they are not properly configured
Now the events are displaying a properly formatted message

$ oc get events |grep SetDesiredConfig
75m         Normal    SetDesiredConfigAndOSImage                 machineconfigpool/worker                                          Targeted node sregidor-voclsec6-knd2c-worker-a-m2krj to MachineConfig: rendered-worker-269adf05864ad9d63d7954781af64f0b / Image: image-registry.openshift-image-registry.svc:5000/mco-tc-73947/ocb-image@sha256:1349bbff7e3e4a2e3314ae98af7b96a968c0aa782adee746998bf828e9204f13

$ oc get events |grep SetDesiredConfigAndOSImage 
75m         Normal    SetDesiredConfigAndOSImage                 machineconfigpool/worker                                          Targeted node sregidor-voclsec6-knd2c-worker-a-m2krj to MachineConfig: rendered-worker-269adf05864ad9d63d7954781af64f0b / Image: image-registry.openshift-image-registry.svc:5000/mco-tc-73947/ocb-image@sha256:1349bbff7e3e4a2e3314ae98af7b96a968c0aa782adee746998bf828e9204f13

Regarding the time gap between MCO builds the image and MCO starts applying the image, we have created a new jira ticket to track this issue https://issues.redhat.com/browse/OCPBUGS-35509
Regarding the machineosbuild resources that are not garbage collected, we have created a new jira ticket to track this behaviour https://issues.redhat.com/browse/OCPBUGS-35512

We have verified that with this fix OCL functionality can properly use the buildOutputs.currentImagePullSecret secret. Hence, we add the qe-approved label. The issues mentioned in this PR will be fixed and tracked using the jira tickets that we have created for that.

/label qe-approved

openshift-ci-robot commented 2 months ago

@cheesesashimi: This pull request references Jira Issue OCPBUGS-33913, which is valid.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

This pull request references Jira Issue OCPBUGS-34261, which is valid.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

In response to [this](https://github.com/openshift/machine-config-operator/pull/4395): >**- What I did** > >The CustomImagePullSecret that is referred to by a MachineOSConfig was not being used. This means that pulling an OS image from a private image registry will not work. This work does the following: > >1. Migrates functions for changing the formatting of pull secrets from the BuildController package into controller/common, allowing reuse. >2. Adds secret volume mounts to the MachineConfigDaemon DaemonSet as well as template rendering logic to collect that information whenever a MachineOSConfig is changed. >3. Adds an MCD-side merge process that reads the secrets from the volume mounts, reads the secrets provided by the ControllerConfig, and then merges them into /etc/mco/internal-image-pull-secret.json. >4. Adds unit and e2e tests to validate that a secret referenced by a MachineOSConfig is written to the nodes' filesystem, overriding any values from the ControllerConfig. > >**- How to verify it** > >1. Bring up a cluster with the TechPreviewNoUpgrade feature-gate added. >2. Set up a MachineOSConfig to use a private image registry for both pushing and pulling. >3. Create a MachineConfigPool and a MachineOSConfig and wait for the build to complete. While this is occurring the MCD DaemonSet will restart since it will now have a secret mounted into it. If multiple MachineOSConfigs are present, multiple secret volumes will be mounted into the MCD pod, even if they are the same secret. >4. Move a node into the new MachineConfigPool and wait for it to drain and reboot. >5. When the node comes back up, it should be booted into the OS image that was pulled from the private image registry. > >**- Description for the changelog** >CurrentImagePullSecret should be used by MachineConfigDaemon > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

cheesesashimi commented 2 months ago

Regarding the update latency bug (OCPBUGS-35509): This PR does not introduce that problem. Instead, it is caused by the lack of informers within NodeController for both the MachineOSConfig and MachineOSBuild objects. See bug for further discussion.

cheesesashimi commented 2 months ago

/test test-unit

openshift-ci[bot] commented 2 months ago

@cheesesashimi: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

/test 4.12-upgrade-from-stable-4.11-images
/test cluster-bootimages
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test e2e-gcp-op-single-node
/test e2e-hypershift
/test images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
/test bootstrap-unit
/test e2e-aws-disruptive
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-aws-ovn-upgrade-out-of-change
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-ovn-upgrade-out-of-change
/test e2e-azure-upgrade
/test e2e-gcp-op-techpreview
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-metal-assisted
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-dualstack
/test e2e-openstack-externallb
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-ovn-upi
/test e2e-vsphere-ovn-upi-zones
/test e2e-vsphere-ovn-zones
/test e2e-vsphere-upgrade
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-images
/test security

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-master-bootstrap-unit
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-techpreview
pull-ci-openshift-machine-config-operator-master-e2e-hypershift
pull-ci-openshift-machine-config-operator-master-e2e-vsphere-ovn-upi
pull-ci-openshift-machine-config-operator-master-e2e-vsphere-ovn-upi-zones
pull-ci-openshift-machine-config-operator-master-e2e-vsphere-ovn-zones
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-security
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to [this](https://github.com/openshift/machine-config-operator/pull/4395#issuecomment-2183177390): >/test test-unit Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

cheesesashimi commented 2 months ago

/test unit

yuqi-zhang commented 2 months ago

/lgtm

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/machine-config-operator/blob/master/OWNERS)~~ [cheesesashimi,yuqi-zhang] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

cheesesashimi commented 2 months ago

/unhold

cheesesashimi commented 2 months ago

/retest-required /test e2e-vsphere-ovn-upi-zones /test e2e-azure-ovn-upgrade-out-of-change

openshift-ci-robot commented 2 months ago

/retest-required

Remaining retests: 0 against base HEAD 2e1b6bc163c3cfb6a0261b26d20e9897823ba156 and 2 for PR HEAD 521d36a9145296ef46c1419382871816b8d29301 in total

openshift-ci-robot commented 2 months ago

@cheesesashimi: Jira Issue OCPBUGS-33913: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#4395

Jira Issue OCPBUGS-33913 has been moved to the MODIFIED state.

Jira Issue OCPBUGS-34261: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#4395

Jira Issue OCPBUGS-34261 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4395): >**- What I did** > >The CustomImagePullSecret that is referred to by a MachineOSConfig was not being used. This means that pulling an OS image from a private image registry will not work. This work does the following: > >1. Migrates functions for changing the formatting of pull secrets from the BuildController package into controller/common, allowing reuse. >2. Adds secret volume mounts to the MachineConfigDaemon DaemonSet as well as template rendering logic to collect that information whenever a MachineOSConfig is changed. >3. Adds an MCD-side merge process that reads the secrets from the volume mounts, reads the secrets provided by the ControllerConfig, and then merges them into /etc/mco/internal-image-pull-secret.json. >4. Adds unit and e2e tests to validate that a secret referenced by a MachineOSConfig is written to the nodes' filesystem, overriding any values from the ControllerConfig. > >**- How to verify it** > >1. Bring up a cluster with the TechPreviewNoUpgrade feature-gate added. >2. Set up a MachineOSConfig to use a private image registry for both pushing and pulling. >3. Create a MachineConfigPool and a MachineOSConfig and wait for the build to complete. While this is occurring the MCD DaemonSet will restart since it will now have a secret mounted into it. If multiple MachineOSConfigs are present, multiple secret volumes will be mounted into the MCD pod, even if they are the same secret. >4. Move a node into the new MachineConfigPool and wait for it to drain and reboot. >5. When the node comes back up, it should be booted into the OS image that was pulled from the private image registry. > >**- Description for the changelog** >CurrentImagePullSecret should be used by MachineConfigDaemon > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

openshift-bot commented 2 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.17.0-202406250241.p0.g58aca69.assembly.stream.el9 for distgit ose-machine-config-operator. All builds following this will include this PR.

cheesesashimi commented 2 months ago

/cherry-pick release-4.16

openshift-cherrypick-robot commented 2 months ago

@cheesesashimi: #4395 failed to apply on top of branch "release-4.16":

Applying: certificatewriter should handle the internal registry pull secret
Using index info to reconstruct a base tree...
M   pkg/controller/template/template_controller.go
M   pkg/daemon/update.go
M   pkg/operator/sync.go
M   test/e2e-techpreview/helpers_test.go
M   test/helpers/utils.go
Falling back to patching base and 3-way merge...
Auto-merging test/helpers/utils.go
CONFLICT (content): Merge conflict in test/helpers/utils.go
Auto-merging test/e2e-techpreview/helpers_test.go
Auto-merging pkg/operator/sync.go
Auto-merging pkg/daemon/update.go
Auto-merging pkg/controller/template/template_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 certificatewriter should handle the internal registry pull secret
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to [this](https://github.com/openshift/machine-config-operator/pull/4395#issuecomment-2189003143): >/cherry-pick release-4.16 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

openshift / machine-config-operator

OCPBUGS-33913, OCPBUGS-34261: CurrentImagePullSecret should be consumed by the MCD #4395