siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.
Other
397 stars 23 forks source link

fix: fix config generation failure due to secure boot status #321

Closed utkuozdemir closed 3 weeks ago

utkuozdemir commented 4 weeks ago

A change in the SecureBootStatus in the MachineStatus did not cause a new config to be generated. This was caused by the ClusterMachineConfigController not taking the MachineStatus into the calculation of the hash from the inputs (UpdateInputsVersions) to determine if it should generate a new config or not.

Modify the existing MachineConfigGenOptions resource to hold the required information to build the install image. Instead of using MachineStatus+ClusterMachineTalosVersion to build the install image URL, use only MachineConfigGenOptions resource.

Additionally, add a migration to pre-populate the new fields in the MachineConfigGenOptions resources, so that they will not be modified on the next Omni upgrade. Also update the input versions of the ClusterMachineConfig to not contain ClusterMachineTalosVersion anymore, so that we won't trigger a config re-apply on the next Omni upgrade.

Unix4ever commented 4 weeks ago

And maybe we should add a unit-test in the controller that will verify that there's no bug anymore

utkuozdemir commented 4 weeks ago

LGTM.

But the fact that the machines running Talos 1.7.2 have no secure boot status information also looks like a bug.

They actually have the status set correctly on the machinestatus - the generation error we see is obsolete/outdated.

Unix4ever commented 4 weeks ago

They actually have the status set correctly on the machinestatus - the generation error we see is obsolete/outdated

But why it's Unknown? Should it be disabled instead?

I think there's some bug in the JS code. The response omits false fields by default, so that might lead to the incorrect results in the UI.

utkuozdemir commented 4 weeks ago

They actually have the status set correctly on the machinestatus - the generation error we see is obsolete/outdated

But why it's Unknown? Should it be disabled instead?

I think there's some bug in the JS code.

Makes sense.

So, the MachineStatus seems completely right:

    securebootstatus:
        enabled: false

But the ClusterMachineConfig is not aware of it.

I think I just found another issue here - here, for the obsolete error to remain, the change in MachineStatus (the change that sets the securebootstatus to be not-nil and enabled: false) should not trigger another generation of config. I think it also happened here, because before generating a new config, we calculate the input versions, and if there is no change in the versions, we just skip it. And while doing it, it seems we don't take MachineStatus into consideration.

Question is, if I add it to the input versions calculation, will it trigger too many config generations?

Unix4ever commented 4 weeks ago

Question is, if I add it to the input versions calculation, will it trigger too many config generations?

I guess it will, but every update with the same state of the secure boot will be no-op. I would actually prefer to get rid of the MachineStatus in this controller. It would be better to split secure boot and schematic parts to a separate resource.

utkuozdemir commented 4 weeks ago

Question is, if I add it to the input versions calculation, will it trigger too many config generations?

I guess it will, but every update with the same state of the secure boot will be no-op. I would actually prefer to get rid of the MachineStatus in this controller. It would be better to split secure boot and schematic parts to a separate resource.

Yep, I got a similar idea now - we copy over the secure boot status into ClusterMachineTalosVersion, just like we do with the schamatic ID. And here we read it from that instead.

utkuozdemir commented 4 weeks ago

@Unix4ever I ended up using ClusterMachineTalosVersion for now, as it already has a working flow. Avoided adding a new resource into the mix atm which potentially cause some other parts to break.

I think we can pair and do a bigger rework later to simplify the whole machine customization flow.

Unix4ever commented 4 weeks ago

We might need to do migration here to avoid triggering the config apply.

And if we rework this later we'll have to do another migration.

As both changes are no-op but change the inputs version, the config will get regenerated in the controllers if we don't do anything.

utkuozdemir commented 4 weeks ago

We might need to do migration here to avoid triggering the config apply.

And if we rework this later we'll have to do another migration.

As both changes are no-op but change the inputs version, the config will get regenerated in the controllers if we don't do anything.

Done as discussed, including the migration. Please have another look so we don't miss anything.

utkuozdemir commented 3 weeks ago

@Unix4ever ended up adding it into MachineConfigGenOptions which simplified things a lot.

utkuozdemir commented 3 weeks ago

/m

utkuozdemir commented 3 weeks ago

/m