Closed djoshy closed 1 month ago
@djoshy: This pull request references MCO-1144 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
/test e2e-gcp-op /test e2e-gcp-op-techpreview /test unit /test verify
@djoshy: This pull request references MCO-1144 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
/retest-required
Rebased on master and fixed @yuqi-zhang's nits. Should be ready for a look (:
/hold
wait for qe-approved
Verification executed using IPI on GCP
Given that we update the boot images update for all machinesets
$ oc patch machineconfiguration cluster --type merge -p '{"spec":{"managedBootImages":{"machineManagers":[{"resource": "machinesets","apiGroup": "machine.openshift.io","selection": {"mode": "All"}}]}}}'
This string to report no errors is a bit weird: Error(s): <nil>
, in json output Error(s): \u003cnil\u003e"
. Is there a better way to display no errors? Would an empty string be more user friendly maybe?
When we create a new machineset and we configure an owner for this machineset an error is reported in the status. Nevertheless, if we remove the machineset (without fixing the error) the status remains the same reporting an error that does not exist anymore.
{
"lastTransitionTime": "2024-07-04T15:29:47Z",
"message": "1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet cloned-tc-74240-owned: unexpected OwnerReferenc
e: fakekind/master. Please remove this machineset from boot image management to avoid errors",
"reason": "MAPIMachinesetUpdated",
"status": "True",
"type": "BootImageUpdateDegraded"
}
The counter for MAPI machinesets is not updated either after removing the machineset ( this counter: "Reconciled 5 of 5 MAPI MachineSets ).
{
"lastTransitionTime": "2024-07-04T17:01:40Z",
"message": "Reconciled 5 of 5 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments",
"reason": "MAPIMachinesetUpdated",
"status": "False",
"type": "BootImageUpdateProgressing"
},
capacity.cluster-autoscaler.kubernetes.io/labels: kubernetes.io/arch=amd64-fake
image: fake-image
The result is:
Should BootImageUpdateProgressing be "true"?
{
"lastTransitionTime": "2024-07-05T14:24:35Z",
"message": "Reconciled 4 of 5 MAPI MachineSets | Reconciled 0 of 0 CAPI MachineSets | Reconciled 0 of 0 CAPI MachineDeployments",
"reason": "MAPIMachinesetUpdated",
"status": "False",
"type": "BootImageUpdateProgressing"
},
{
"lastTransitionTime": "2024-07-05T14:24:36Z",
"message": "1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments | Error(s): error syncing MAPI MachineSet sregidor-bimv-l75p2-worker-a-modified: failed t
o fetch arch during machineset sync: invalid architecture value found in annotation: kubernetes.io/arch=amd641 ",
"reason": "MAPIMachinesetUpdated",
"status": "True",
"type": "BootImageUpdateDegraded"
}
The MCCBootImageUpdateError alert is not triggered when there are problems updating the boot images. Reading the code it looks like intended, but I would like like to double check that it is intended.
Test cases passed:
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74238-BootImages not updated by default [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74239-ManagedBootImages on GCP. Restore Partial MachineSet images [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74240-ManagedBootImages on GCP. Restore All MachineSet images [Disruptive] [Serial]"
Thanks for the review @sergiordlr! Responses inline:
$ oc patch machineconfiguration cluster --type merge -p '{"spec":{"managedBootImages":{"machineManagers":[{"resource": "machinesets","apiGroup": "machine.openshift.io","selection": {"mode": "All"}}]}}}'
- This string to report no errors is a bit weird:
Error(s): <nil>
, in json outputError(s): \u003cnil\u003e"
. Is there a better way to display no errors? Would an empty string be more user friendly maybe?
Agreed, this could be neater, I've tried to clean it up for the no error case - let me know what you think!
- When we create a new machineset and we configure an owner for this machineset an error is reported in the status. Nevertheless, if we remove the machineset (without fixing the error) the status remains the same reporting an error that does not exist anymore.
I think this is because the controller is not handling machineset deletes, I've added a new callback for this.
- We hit a problem but I'm not sure if it has been introduced in this PR (I can check if you want to). When we execute these steps the machineset is never updated:
This is not removing the error because the controller is not checking annotations for changes, it only attempts an update when it can see the labels, ownerreferences, providerSpec of a machineset has changed. I have added annotations to the list of things that should be checked, so that should fix this!
- The MCCBootImageUpdateError alert is not triggered when there are problems updating the boot images. Reading the code it looks like intended, but I would like like to double check that it is intended.
This is intended, as noted in the PR description, we are removing the alert.
Hmm, I guess it is a hassle to do a chain'ed if check to see if everything under here exists, but I'm just concerned that e.g. ms.Spec.Template.Spec exists, but ms.Spec.Template.Spec.ProviderSpec is empty and we panic here, which is not idea. @yuqi-zhang
Closing the loop on this, I took a look at how these validations are done in the machine api operator, and they seem to have webhooks setup so that an empty object in the chain of ms.Spec.Template.Spec.ProviderSpec
isn't possible. I tested this out on my cluster too and they check out.
@djoshy: This pull request references MCO-1144 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
Latest tests summary:
Thanks for the extensive testing @sergiordlr ! I have fixed the panic, I was casting the wrong type :cry:
/test e2e-gcp-op-single-node
/test e2e-gcp-op-single-node
@djoshy: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/e2e-vsphere-ovn-zones | 1dccac4adc211b7761654e02c5c5c28266f636ba | link | false | /test e2e-vsphere-ovn-zones |
ci/prow/e2e-vsphere-ovn-upi-zones | 1dccac4adc211b7761654e02c5c5c28266f636ba | link | false | /test e2e-vsphere-ovn-upi-zones |
Full PR test history. Your PR dashboard.
Verified using IPI on GCP
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74238-BootImages not updated by default [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74239-ManagedBootImages on GCP. Restore Partial MachineSet images [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74240-ManagedBootImages on GCP. Restore All MachineSet images [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74751-ManagedBootImages on GCP. Fix errors [Disruptive] [Serial]"
"[sig-mco] MCO Bootimages Author:sregidor-NonHyperShiftHOST-NonPreRelease-Medium-74764-ManagedBootImages on GCP. Delete machineset when error [Disruptive] [Serial]"
/label qe-approved
@djoshy: This pull request references MCO-1144 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.
reapplying
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: djoshy, yuqi-zhang
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/unhold
/label acknowledge-critical-fixes-only
As this is behind a feature gate, it fits the criteria for the label. We need this merged to start working on the AWS platform as well.
[ART PR BUILD NOTIFIER]
This PR has been included in build ose-machine-config-operator-container-v4.17.0-202407132040.p0.g6968535.assembly.stream.el9 for distgit ose-machine-config-operator. All builds following this will include this PR.
What I did
MachineConfigurationBootImageUpdateProgressing
andMachineConfigurationBootImageUpdateDegraded
, along with a message detailing how many machine resources are in that respective state.MachineConfigurationBootImageUpdateDegraded=True
.How to test
Bring up a cluster in tech preview. Check the the conditions on the MachineConfiguration/cluster object, there will be no conditions as no machinesets have been opted-in
Apply a
MachineConfigurion
object that will opt-in some or all machinesets. In this case, I am applying one that opts in all machinesets:The conditions should update to reflect this behavior:
Add an owner reference or something else to cause an error on a machineset. The owner reference can be anything, the content of it does not matter - as long as it exists. In this case, I added a dummy ownerReference such as this to one of the machinesets
This will cause a degrade that should be visible in the conditions of the MachineConfiguration object and MCO's cluster operator object:
Removing the owner reference should remove the degrade on the CO object and update the MachineConfiguration conditions back to normal. Other errors should cause the same result, the ownerReferences path is just the fastest way to do it without affecting the cluster too much. For example, you could also:
All e2e's should pass, including the new
TestBootImageDegradeCondition
.