kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.48k stars 1.29k forks source link

Support in-place updates to Infrastructure Machines specs #10629

Open muraee opened 3 months ago

muraee commented 3 months ago

What would you like to be added (User Story)?

As a user I would like that my changes to platform supported mutable fields in the InfrastructureMachineTemplate is propagated in-place without causing a rollout.

Detailed Description

Problem:

For example, the AWSMachine supports day-2 modification of .spec.additionalTags field, which is then reconciled and applied to existing machines (EC2 instances) in-place.

However, to update .spec.additionalTags when using a machineDeployment or machineSet with an AWSMachineTemplate as the InfrastructureRef, a new AWSMachineTemplate with the new tags has to be created which will trigger a complete rollout instead of updating the existing machines in-place.

Suggested solution:

MachineSet

https://github.com/kubernetes-sigs/cluster-api/pull/8060 added support for in-place propagating of labels and annotations to the infraMachines. We could build on top of that work to also fetch the spec of the referenced template within syncMachines() func and propagate that to the existing infraMachines alongside annotations and labels. so that updating the referenced InfraMachineTemplate in-place would just work.

MachineDeployment

The current UX of triggering upgrades, relies on users changing the infraMachineTemplate reference name in the MachineDeployment. We could preserve that UX for in-place upgrades as well, by changing the way the MachineDeployment controller determines when a rollout is required as the following:

  1. The user would set an annotation/label on the InfraMachineTemplate that contains the hash of its spec.
  2. the MachineDeployment controller definition of "semantic equality" should exclude the InfraMachineTemplate ref, and include the hash instead, i.e. a rollout is only triggered if the hash of the new InfraMachineTemplate is different than the current template hash.
  3. [Backward-compatibility] if the hash is not found, keep the current "semantic equality" behavior.

As the hash is created and set by the user, they could exclude any mutable field from the hash calculation, e.g.: two AWSMachineTemplate that only differs in .spec.additionalTags would have the same hash to not trigger a rollout, but instead propagate in-place

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature /area machineset /area machinedeployment

muraee commented 3 months ago

cc @enxebre

sbueringer commented 3 months ago

cc @g-gaston

g-gaston commented 3 months ago

This could tangentially be related to the work being done by the in-place upgrades working group. Although I'm not sure if that's what we are looking for here. The in-place upgrades we have been working is for things that need to change in the actual infra/host. This seems to be more about metadata.

Without having thought much about it and not being an expert in the aws provider, could this be implemented in the AWSMachineTemplate controller? when changes are detected in this field, propagate to all child AwsMachines. Updating the template object should not trigger a MachineRollout from the MD controller.

enxebre commented 3 months ago

Yes, I see this as orthogonal to inplace upgrade effort. I see this as an extension of the fields inplace propagation effort to include support for cloud tags.

Interesting thinking on the AWSMachineTemplate. I think I'd rather keep it "dumb" to honour current design and explore Mulham proposal to enable an extension enhancement linked above and let the higher level controllers md/ms orchestrate it. But that has also some caveats. Let's hear more thoughts :)

fabriziopandini commented 3 months ago

I'm personally -1 to the idea of asking the users to add an annotation with an hash defining semantic equality, it seems too complex (we are increasing operational and conceptual complexity for Cluster API’s users)

I'm also not sure this fits into the work we did for field propagation, because it was strictly scoped to Kubernetes objects metadata and a small set of fields that impacts only CAPI controllers behaviour. Metadata applied to infrastructure components doesn't seems to fit into the category above (they are different metadata). Also, metadata / AWS's additionalTags seems to be just an example, while the ask is about a wider set of "mutable fields" .

I think instead this request could fit well into the in-place upgrade work, because based on my understanding this work is meant to introduce an extension point that allows to determine what could be changed in place and what not, and this gives every infrastructure provider the freedom to choose what makes sense for them e.g.

Most important, this extension point also could gives the ability to take decisions when multiple changes are applied at the same time (what if I change tags, disk size and region in a single operation?)

vincepri commented 3 months ago

/kind proposal /priority important-soon /triage accepted

alexander-demicev commented 3 months ago

We briefly discussed this topic at the in-place update feature group meeting yesterday. I believe that with some changes it might fit into the current design https://hackmd.io/Wv_u2xXJQsaj4wWFQ3PqCQ?view. I like the idea proposed by @fabriziopandini, we can have multiple in-place update extensions. When users select ExternalUpdate strategy it will be the responsibility of the in-place update extension to either upgrade K8S components on the hosts or update the infrastructure-related fields. There are a couple of questions here we will need to answer for this approach, for example: in what order in-place update extensions will be executed? Is ordering needed there at all or should we just ensure that only extensions run at the same time?

k8s-ci-robot commented 3 months ago

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
fabriziopandini commented 3 months ago

Let's add triage accepted only when the issue is actionable. At the current stage we are still seeking for consensus about a way forward

fabriziopandini commented 3 months ago

There are a couple of questions here we will need to answer for this approach, for example: in what order in-place update extensions will be executed? Is ordering needed there at all or should we just ensure that only extensions run at the same time?

@alexander-demicev @muraee please bring those points up into in-place feature group meetings

muraee commented 3 months ago

this the correct doc for the in-place feature group meetings, right? https://docs.google.com/document/d/1GmRd6MyQ0mWAoJV6rCHhZTSTtKMKHdJzhXm0BLBXOnw

unfortunately, I won't be able to make it to the meeting today, but will bring it up next week for sure.

sbueringer commented 1 week ago

There's now a PR open with the proposal: https://github.com/kubernetes-sigs/cluster-api/pull/11029

sbueringer commented 1 week ago

@muraee Can you please check if the proposal addresses your requirements? (and if yes, deduplicate & close this issue)