MCO-1144: Implement progressing/error boot images conditions on the MachineConfiguration object

djoshy commented 3 months ago

What I did

Moved feature gate detection outside the machine set boot image controller(MSBIC). The MSBIC will now only run when in tech preview, similar to the pinned image set controller.
Removed the workqueue based system in the MSBIC, this helped with streamlining how the new conditions can be aggregated. This also removes the old metric based alert system.
The MSBIC will now post two conditions: MachineConfigurationBootImageUpdateProgressing and MachineConfigurationBootImageUpdateDegraded, along with a message detailing how many machine resources are in that respective state.
The operator degrades if the MSBIC posts an MachineConfigurationBootImageUpdateDegraded=True.
Added an e2e-techpreview test to verify that an misconfigured machineset causes a degrade on the operator

How to test

Bring up a cluster in tech preview. Check the the conditions on the MachineConfiguration/cluster object, there will be no conditions as no machinesets have been opted-in
```
$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}'
[]
```

Apply a MachineConfigurion object that will opt-in some or all machinesets. In this case, I am applying one that opts in all machinesets:

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
name: cluster
namespace: openshift-machine-config-operator
spec:
logLevel: Normal
operatorLogLevel: Normal
managedBootImages:
machineManagers:
  - resource: machinesets
    apiGroup: machine.openshift.io
    selection:
      mode: All

The conditions should update to reflect this behavior:

$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}'
[
[
    {
        "lastTransitionTime": "2024-06-05T20:03:34Z",
        "message": "Reconciled 4 of 4 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments",
        "reason": "MAPIMachinesetUpdated",
        "status": "False",
        "type": "BootImageUpdateProgressing"
    },
    {
        "lastTransitionTime": "2024-06-05T20:03:25Z",
        "message": "0 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments",
        "reason": "MAPIMachinesetUpdated",
        "status": "False",
        "type": "BootImageUpdateDegraded"
    }
]
]

Add an owner reference or something else to cause an error on a machineset. The owner reference can be anything, the content of it does not matter - as long as it exists. In this case, I added a dummy ownerReference such as this to one of the machinesets

$ oc get machinesets.machine.openshift.io/ci-ln-yqjiw12-72292-286vv-worker-a -n openshift-machine-api -o jsonpath-as-json='{.metadata.ownerReferences}'
[
[
    {
        "apiVersion": "test",
        "kind": "test",
        "name": "test",
        "uid": "test"
    }
]
]

This will cause a degrade that should be visible in the conditions of the MachineConfiguration object and MCO's cluster operator object:

$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}'
[
[
    {
        "lastTransitionTime": "2024-06-05T20:08:14Z",
        "message": "Reconciled 3 of 4 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments",
        "reason": "MAPIMachinesetUpdated",
        "status": "False",
        "type": "BootImageUpdateProgressing"
    },
    {
        "lastTransitionTime": "2024-06-05T20:08:15Z",
        "message": "1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet ci-ln-yqjiw12-72292-286vv-worker-a: unexpected OwnerReference: test/test. Please remove this machineset from boot image management to avoid errors",
        "reason": "MAPIMachinesetUpdated",
        "status": "True",
        "type": "BootImageUpdateDegraded"
    }
]
]
$ oc get co/machine-config
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.17.0-0.test-2024-06-05-170915-ci-ln-yqjiw12-latest   True        False         True       161m    Failed to resync 4.17.0-0.test-2024-06-05-170915-ci-ln-yqjiw12-latest because: bootimage update failed: 1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet ci-ln-yqjiw12-72292-286vv-worker-a: unexpected OwnerReference: test/test. Please remove this machineset from boot image management to avoid errors

Removing the owner reference should remove the degrade on the CO object and update the MachineConfiguration conditions back to normal. Other errors should cause the same result, the ownerReferences path is just the fastest way to do it without affecting the cluster too much. For example, you could also:

Scale down the CVO and delete/mangle the coreos-bootimages configmap, so that the controller cannot find the update boot images anymore. In such a case, the controller will timeout after being unable to fetch and verify the configmap after 15 minutes.
Mangle the machineset so it is an unexpected format, causing an error while unmarshalling.

All e2e's should pass, including the new TestBootImageDegradeCondition.

openshift-ci-robot commented 3 months ago

@djoshy: This pull request references MCO-1144 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4390): > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

openshift-ci[bot] commented 3 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

djoshy commented 3 months ago

/test e2e-gcp-op /test e2e-gcp-op-techpreview /test unit /test verify

openshift-ci-robot commented 3 months ago

@djoshy: This pull request references MCO-1144 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4390): >**What I did** > >- Moved feature gate detection outside the machine set boot image controller(MSBIC). The MSBIC will now only run when in tech preview, similar to the pinned image set controller. >- Removed the workqueue based system in the MSBIC, this helped with streamlining how the new conditions can be aggregated. This also removes the old metric based alert system. >- The MSBIC will now post two conditions: `MachineConfigurationBootImageUpdateProgressing` and `MachineConfigurationBootImageUpdateDegraded`, along with a message detailing how many machine resources are in that respective state. >- The operator degrades if the MSBIC posts an `MachineConfigurationBootImageUpdateDegraded=True`. >- Added an e2e-techpreview test to verify that an misconfigured machineset causes a degrade on the operator > >**How to test** > >- Bring up a cluster in tech preview. Check the the conditions on the MachineConfiguration/cluster object, there will be no conditions as no machinesets have been opted-in >``` >$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}' >[] >``` >- Apply a `MachineConfigurion` object that will opt-in some or all machinesets. In this case, I am applying one that opts in all machinesets: >``` >apiVersion: operator.openshift.io/v1 >kind: MachineConfiguration >metadata: > name: cluster > namespace: openshift-machine-config-operator >spec: > logLevel: Normal > operatorLogLevel: Normal > managedBootImages: > machineManagers: > - resource: machinesets > apiGroup: machine.openshift.io > selection: > mode: All >``` >The conditions should update to reflect this behavior: >``` >$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}' >[ > [ > { > "lastTransitionTime": "2024-06-05T20:03:34Z", > "message": "Reconciled 4 of 4 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments", > "reason": "MAPIMachinesetUpdated", > "status": "False", > "type": "BootImageUpdateProgressing" > }, > { > "lastTransitionTime": "2024-06-05T20:03:25Z", > "message": "0 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): \u003cnil\u003e", > "reason": "MAPIMachinesetUpdated", > "status": "False", > "type": "BootImageUpdateDegraded" > } > ] >] >``` >- Add an owner reference or something else to cause an error on a machineset. The owner reference can be anything, the content of it does not matter - as long as it exists. In this case, I added a dummy ownerReference such as this to one of the machinesets >``` >$ oc get machinesets.machine.openshift.io/ci-ln-yqjiw12-72292-286vv-worker-a -n openshift-machine-api -o jsonpath-as-json='{.metadata.ownerReferences}' >[ > [ > { > "apiVersion": "test", > "kind": "test", > "name": "test", > "uid": "test" > } > ] >] >``` >This will cause a degrade that should be visible in the conditions of the MachineConfiguration object and MCO's cluster operator object: >``` >$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}' >[ > [ > { > "lastTransitionTime": "2024-06-05T20:08:14Z", > "message": "Reconciled 3 of 4 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments", > "reason": "MAPIMachinesetUpdated", > "status": "False", > "type": "BootImageUpdateProgressing" > }, > { > "lastTransitionTime": "2024-06-05T20:08:15Z", > "message": "1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet ci-ln-yqjiw12-72292-286vv-worker-a: unexpected OwnerReference: test/test. Please remove this machineset from boot image management to avoid errors", > "reason": "MAPIMachinesetUpdated", > "status": "True", > "type": "BootImageUpdateDegraded" > } > ] >] >$ oc get co/machine-config >NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE >machine-config 4.17.0-0.test-2024-06-05-170915-ci-ln-yqjiw12-latest True False True 161m Failed to resync 4.17.0-0.test-2024-06-05-170915-ci-ln-yqjiw12-latest because: bootimage update failed: 1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet ci-ln-yqjiw12-72292-286vv-worker-a: unexpected OwnerReference: test/test. Please remove this machineset from boot image management to avoid errors >``` >Removing the owner reference should remove the degrade on the CO object and update the MachineConfiguration conditions back to normal. Other errors should cause the same result, the ownerReferences path is just the fastest way to do it without affecting the cluster too much. For example, you could also: >1. Scale down the CVO and delete/mangle the coreos-bootimages configmap, so that the controller cannot find the update boot images anymore. In such a case, the controller will timeout after being unable to fetch and verify the configmap after 15 minutes. >2. Mangle the machineset so it is an unexpected format, causing an error while unmarshalling. > >- All e2e's should pass, including the new `TestBootImageDegradeCondition`. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

djoshy commented 3 months ago

/retest-required

djoshy commented 2 months ago

Rebased on master and fixed @yuqi-zhang's nits. Should be ready for a look (:

sergiordlr commented 2 months ago

/hold

wait for qe-approved

sergiordlr commented 2 months ago

Verification executed using IPI on GCP

Given that we update the boot images update for all machinesets

$ oc patch machineconfiguration cluster --type merge -p '{"spec":{"managedBootImages":{"machineManagers":[{"resource": "machinesets","apiGroup": "machine.openshift.io","selection": {"mode": "All"}}]}}}'

This string to report no errors is a bit weird: Error(s): <nil>, in json output Error(s): \u003cnil\u003e" . Is there a better way to display no errors? Would an empty string be more user friendly maybe?
When we create a new machineset and we configure an owner for this machineset an error is reported in the status. Nevertheless, if we remove the machineset (without fixing the error) the status remains the same reporting an error that does not exist anymore.

    {
            "lastTransitionTime": "2024-07-04T15:29:47Z",
            "message": "1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet cloned-tc-74240-owned: unexpected OwnerReferenc
e: fakekind/master. Please remove this machineset from boot image management to avoid errors",
            "reason": "MAPIMachinesetUpdated",
            "status": "True",
            "type": "BootImageUpdateDegraded"
        }

The counter for MAPI machinesets is not updated either after removing the machineset ( this counter: "Reconciled 5 of 5 MAPI MachineSets ).

        {
            "lastTransitionTime": "2024-07-04T17:01:40Z",
            "message": "Reconciled 5 of 5 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments",
            "reason": "MAPIMachinesetUpdated",
            "status": "False",
            "type": "BootImageUpdateProgressing"
        },

We hit a problem but I'm not sure if it has been introduced in this PR (I can check if you want to). When we execute these steps the machineset is never updated:
1. Create new machineset
2. Patch the machineset to set a wrong arch value capacity.cluster-autoscaler.kubernetes.io/labels: kubernetes.io/arch=amd64-fake
3. Patch the machineset to set a wrong boot image image: fake-image
4. An error is reported in the status because of the wrong architecture
5. Fix the architecture value

The result is:

The machineset is never updated (it shoud be updated since the error was fixed, right?)
The status is reporting an error, and it says that it is not progressing

Should BootImageUpdateProgressing be "true"?

        {
            "lastTransitionTime": "2024-07-05T14:24:35Z",
            "message": "Reconciled 4 of 5 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments",
            "reason": "MAPIMachinesetUpdated",
            "status": "False",
            "type": "BootImageUpdateProgressing"
        },
    {
            "lastTransitionTime": "2024-07-05T14:24:36Z",
            "message": "1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet sregidor-bimv-l75p2-worker-a-modified: failed t
o fetch arch during machineset sync: invalid architecture value found in annotation: kubernetes.io/arch=amd641 ",
            "reason": "MAPIMachinesetUpdated",
            "status": "True",
            "type": "BootImageUpdateDegraded"
        }

The MCCBootImageUpdateError alert is not triggered when there are problems updating the boot images. Reading the code it looks like intended, but I would like like to double check that it is intended.
Test cases passed:

"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74238-BootImages not updated by default [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74239-ManagedBootImages on GCP. Restore Partial MachineSet images [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74240-ManagedBootImages on GCP. Restore All MachineSet images [Disruptive] [Serial]"

djoshy commented 1 month ago

Thanks for the review @sergiordlr! Responses inline:

$ oc patch machineconfiguration cluster --type merge -p '{"spec":{"managedBootImages":{"machineManagers":[{"resource": "machinesets","apiGroup": "machine.openshift.io","selection": {"mode": "All"}}]}}}'

This string to report no errors is a bit weird: Error(s): <nil>, in json output Error(s): \u003cnil\u003e" . Is there a better way to display no errors? Would an empty string be more user friendly maybe?

Agreed, this could be neater, I've tried to clean it up for the no error case - let me know what you think!

When we create a new machineset and we configure an owner for this machineset an error is reported in the status. Nevertheless, if we remove the machineset (without fixing the error) the status remains the same reporting an error that does not exist anymore.

I think this is because the controller is not handling machineset deletes, I've added a new callback for this.

We hit a problem but I'm not sure if it has been introduced in this PR (I can check if you want to). When we execute these steps the machineset is never updated:

This is not removing the error because the controller is not checking annotations for changes, it only attempts an update when it can see the labels, ownerreferences, providerSpec of a machineset has changed. I have added annotations to the list of things that should be checked, so that should fix this!

The MCCBootImageUpdateError alert is not triggered when there are problems updating the boot images. Reading the code it looks like intended, but I would like like to double check that it is intended.

This is intended, as noted in the PR description, we are removing the alert.

Hmm, I guess it is a hassle to do a chain'ed if check to see if everything under here exists, but I'm just concerned that e.g. ms.Spec.Template.Spec exists, but ms.Spec.Template.Spec.ProviderSpec is empty and we panic here, which is not idea. @yuqi-zhang

Closing the loop on this, I took a look at how these validations are done in the machine api operator, and they seem to have webhooks setup so that an empty object in the chain of ms.Spec.Template.Spec.ProviderSpec isn't possible. I tested this out on my cluster too and they check out.

openshift-ci-robot commented 1 month ago

@djoshy: This pull request references MCO-1144 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4390): >**What I did** > >- Moved feature gate detection outside the machine set boot image controller(MSBIC). The MSBIC will now only run when in tech preview, similar to the pinned image set controller. >- Removed the workqueue based system in the MSBIC, this helped with streamlining how the new conditions can be aggregated. This also removes the old metric based alert system. >- The MSBIC will now post two conditions: `MachineConfigurationBootImageUpdateProgressing` and `MachineConfigurationBootImageUpdateDegraded`, along with a message detailing how many machine resources are in that respective state. >- The operator degrades if the MSBIC posts an `MachineConfigurationBootImageUpdateDegraded=True`. >- Added an e2e-techpreview test to verify that an misconfigured machineset causes a degrade on the operator > >**How to test** > >- Bring up a cluster in tech preview. Check the the conditions on the MachineConfiguration/cluster object, there will be no conditions as no machinesets have been opted-in >``` >$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}' >[] >``` >- Apply a `MachineConfigurion` object that will opt-in some or all machinesets. In this case, I am applying one that opts in all machinesets: >``` >apiVersion: operator.openshift.io/v1 >kind: MachineConfiguration >metadata: > name: cluster > namespace: openshift-machine-config-operator >spec: > logLevel: Normal > operatorLogLevel: Normal > managedBootImages: > machineManagers: > - resource: machinesets > apiGroup: machine.openshift.io > selection: > mode: All >``` >The conditions should update to reflect this behavior: >``` >$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}' >[ > [ > { > "lastTransitionTime": "2024-06-05T20:03:34Z", > "message": "Reconciled 4 of 4 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments", > "reason": "MAPIMachinesetUpdated", > "status": "False", > "type": "BootImageUpdateProgressing" > }, > { > "lastTransitionTime": "2024-06-05T20:03:25Z", > "message": "0 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments", > "reason": "MAPIMachinesetUpdated", > "status": "False", > "type": "BootImageUpdateDegraded" > } > ] >] >``` >- Add an owner reference or something else to cause an error on a machineset. The owner reference can be anything, the content of it does not matter - as long as it exists. In this case, I added a dummy ownerReference such as this to one of the machinesets >``` >$ oc get machinesets.machine.openshift.io/ci-ln-yqjiw12-72292-286vv-worker-a -n openshift-machine-api -o jsonpath-as-json='{.metadata.ownerReferences}' >[ > [ > { > "apiVersion": "test", > "kind": "test", > "name": "test", > "uid": "test" > } > ] >] >``` >This will cause a degrade that should be visible in the conditions of the MachineConfiguration object and MCO's cluster operator object: >``` >$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}' >[ > [ > { > "lastTransitionTime": "2024-06-05T20:08:14Z", > "message": "Reconciled 3 of 4 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments", > "reason": "MAPIMachinesetUpdated", > "status": "False", > "type": "BootImageUpdateProgressing" > }, > { > "lastTransitionTime": "2024-06-05T20:08:15Z", > "message": "1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet ci-ln-yqjiw12-72292-286vv-worker-a: unexpected OwnerReference: test/test. Please remove this machineset from boot image management to avoid errors", > "reason": "MAPIMachinesetUpdated", > "status": "True", > "type": "BootImageUpdateDegraded" > } > ] >] >$ oc get co/machine-config >NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE >machine-config 4.17.0-0.test-2024-06-05-170915-ci-ln-yqjiw12-latest True False True 161m Failed to resync 4.17.0-0.test-2024-06-05-170915-ci-ln-yqjiw12-latest because: bootimage update failed: 1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet ci-ln-yqjiw12-72292-286vv-worker-a: unexpected OwnerReference: test/test. Please remove this machineset from boot image management to avoid errors >``` >Removing the owner reference should remove the degrade on the CO object and update the MachineConfiguration conditions back to normal. Other errors should cause the same result, the ownerReferences path is just the fastest way to do it without affecting the cluster too much. For example, you could also: >1. Scale down the CVO and delete/mangle the coreos-bootimages configmap, so that the controller cannot find the update boot images anymore. In such a case, the controller will timeout after being unable to fetch and verify the configmap after 15 minutes. >2. Mangle the machineset so it is an unexpected format, causing an error while unmarshalling. > >- All e2e's should pass, including the new `TestBootImageDegradeCondition`. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

sergiordlr commented 1 month ago

Latest tests summary:

Sometimes machineconfiguration was reporting a degraded status and it was reporting progress=true forever. It has been fixed.
When removing the boot images CM (oc delete cm coreos-bootimages -n openshift-machine-config-operator) a panic happened in the controller.

djoshy commented 1 month ago

Thanks for the extensive testing @sergiordlr ! I have fixed the panic, I was casting the wrong type :cry:

djoshy commented 1 month ago

/test e2e-gcp-op-single-node

djoshy commented 1 month ago

/test e2e-gcp-op-single-node

openshift-ci[bot] commented 1 month ago

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-vsphere-ovn-zones	1dccac4adc211b7761654e02c5c5c28266f636ba	link	false	`/test e2e-vsphere-ovn-zones`
ci/prow/e2e-vsphere-ovn-upi-zones	1dccac4adc211b7761654e02c5c5c28266f636ba	link	false	`/test e2e-vsphere-ovn-upi-zones`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).

sergiordlr commented 1 month ago

Verified using IPI on GCP

The error message is user friendly now
When we fix the error in the architecture the machineset is properly updated now
When we delete a failing machineset the machineconfiguration conditions are updated accordingly
No panic happens when we remove the coreos-bootimages configmap
Progress and degraded status are reported consistently
Tests passed:

"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74238-BootImages not updated by default [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74239-ManagedBootImages on GCP. Restore Partial MachineSet images [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74240-ManagedBootImages on GCP. Restore All MachineSet images [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74751-ManagedBootImages on GCP. Fix errors [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74764-ManagedBootImages on GCP. Delete machineset when error [Disruptive] [Serial]"

/label qe-approved

openshift-ci-robot commented 1 month ago

@djoshy: This pull request references MCO-1144 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4390): >**What I did** > >- Moved feature gate detection outside the machine set boot image controller(MSBIC). The MSBIC will now only run when in tech preview, similar to the pinned image set controller. >- Removed the workqueue based system in the MSBIC, this helped with streamlining how the new conditions can be aggregated. This also removes the old metric based alert system. >- The MSBIC will now post two conditions: `MachineConfigurationBootImageUpdateProgressing` and `MachineConfigurationBootImageUpdateDegraded`, along with a message detailing how many machine resources are in that respective state. >- The operator degrades if the MSBIC posts an `MachineConfigurationBootImageUpdateDegraded=True`. >- Added an e2e-techpreview test to verify that an misconfigured machineset causes a degrade on the operator > >**How to test** > >- Bring up a cluster in tech preview. Check the the conditions on the MachineConfiguration/cluster object, there will be no conditions as no machinesets have been opted-in >``` >$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}' >[] >``` >- Apply a `MachineConfigurion` object that will opt-in some or all machinesets. In this case, I am applying one that opts in all machinesets: >``` >apiVersion: operator.openshift.io/v1 >kind: MachineConfiguration >metadata: > name: cluster > namespace: openshift-machine-config-operator >spec: > logLevel: Normal > operatorLogLevel: Normal > managedBootImages: > machineManagers: > - resource: machinesets > apiGroup: machine.openshift.io > selection: > mode: All >``` >The conditions should update to reflect this behavior: >``` >$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}' >[ > [ > { > "lastTransitionTime": "2024-06-05T20:03:34Z", > "message": "Reconciled 4 of 4 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments", > "reason": "MAPIMachinesetUpdated", > "status": "False", > "type": "BootImageUpdateProgressing" > }, > { > "lastTransitionTime": "2024-06-05T20:03:25Z", > "message": "0 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments", > "reason": "MAPIMachinesetUpdated", > "status": "False", > "type": "BootImageUpdateDegraded" > } > ] >] >``` >- Add an owner reference or something else to cause an error on a machineset. The owner reference can be anything, the content of it does not matter - as long as it exists. In this case, I added a dummy ownerReference such as this to one of the machinesets >``` >$ oc get machinesets.machine.openshift.io/ci-ln-yqjiw12-72292-286vv-worker-a -n openshift-machine-api -o jsonpath-as-json='{.metadata.ownerReferences}' >[ > [ > { > "apiVersion": "test", > "kind": "test", > "name": "test", > "uid": "test" > } > ] >] >``` >This will cause a degrade that should be visible in the conditions of the MachineConfiguration object and MCO's cluster operator object: >``` >$ oc get machineconfiguration/cluster -o jsonpath-as-json='{.status.conditions}' >[ > [ > { > "lastTransitionTime": "2024-06-05T20:08:14Z", > "message": "Reconciled 3 of 4 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments", > "reason": "MAPIMachinesetUpdated", > "status": "False", > "type": "BootImageUpdateProgressing" > }, > { > "lastTransitionTime": "2024-06-05T20:08:15Z", > "message": "1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet ci-ln-yqjiw12-72292-286vv-worker-a: unexpected OwnerReference: test/test. Please remove this machineset from boot image management to avoid errors", > "reason": "MAPIMachinesetUpdated", > "status": "True", > "type": "BootImageUpdateDegraded" > } > ] >] >$ oc get co/machine-config >NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE >machine-config 4.17.0-0.test-2024-06-05-170915-ci-ln-yqjiw12-latest True False True 161m Failed to resync 4.17.0-0.test-2024-06-05-170915-ci-ln-yqjiw12-latest because: bootimage update failed: 1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet ci-ln-yqjiw12-72292-286vv-worker-a: unexpected OwnerReference: test/test. Please remove this machineset from boot image management to avoid errors >``` >Removing the owner reference should remove the degrade on the CO object and update the MachineConfiguration conditions back to normal. Other errors should cause the same result, the ownerReferences path is just the fastest way to do it without affecting the cluster too much. For example, you could also: >1. Scale down the CVO and delete/mangle the coreos-bootimages configmap, so that the controller cannot find the update boot images anymore. In such a case, the controller will timeout after being unable to fetch and verify the configmap after 15 minutes. >2. Mangle the machineset so it is an unexpected format, causing an error while unmarshalling. > >- All e2e's should pass, including the new `TestBootImageDegradeCondition`. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

yuqi-zhang commented 1 month ago

reapplying

/lgtm

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/machine-config-operator/blob/master/OWNERS)~~ [djoshy,yuqi-zhang] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

djoshy commented 1 month ago

/unhold

djoshy commented 1 month ago

/label acknowledge-critical-fixes-only

As this is behind a feature gate, it fits the criteria for the label. We need this merged to start working on the AWS platform as well.

openshift-bot commented 1 month ago

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.17.0-202407132040.p0.g6968535.assembly.stream.el9 for distgit ose-machine-config-operator. All builds following this will include this PR.

openshift / machine-config-operator

MCO-1144: Implement progressing/error boot images conditions on the MachineConfiguration object #4390