kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.57k stars 1.31k forks source link

MachinePools cannot be upgraded atomically when used with managed providers #7248

Open shyamradhakrishnan opened 2 years ago

shyamradhakrishnan commented 2 years ago

What steps did you take and what happened: Many providers have built their managed node pool based on CPI machinepools. In most cases, customers can provide a custom vm image to be used by the node pool. for example here - https://github.com/oracle/cluster-api-provider-oci/blob/63f38727a6fba8ef78668fcb8187922ddf3cbfa8/exp/api/v1beta1/ocimanagedmachinepool_types.go#L180

The nodepool version is take from the machinepool spec like follows MachinePool.Spec.Template.Spec.Version.

This results in an inability to upgrade the machinepool(node pools) atomically. Because if we have to upgrade the kubernetes version, the version in CAPI MachinePool spec as well as the custom image infrastructure machinepool spec has to be changed and since they are in different objects, they cannot be changed atomically.

What did you expect to happen:

Machinepools can be upgraded tomically.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

shyamradhakrishnan commented 2 years ago

/cc @richardcase

sbueringer commented 2 years ago

Hm, on KCP/MD/MS it would work like this: An additional InfrastructureMachineTemplate would be created and then the KCP/MD/MS would be updated to the new Kubernetes version and the new InfrastructureMachineTemplate at the same time.

I assume something similar doesn't work with the MachinePool as the InfraMachinePool is representing the "machines" directly so there is no rotation of the InfraMachinePool?

shyamradhakrishnan commented 2 years ago

@sbueringer yeah, the current implementation of machinepools does not have a rotation, the machinepool references a direct infra machine pool. for example https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/templates/cluster-template-aks.yaml#L58 and its the same for all providers I saw.

fabriziopandini commented 2 years ago

/cc @CecileRobertMichon @jackfrancis

CecileRobertMichon commented 2 years ago

This isn't something we've encountered with AzureManagedMachinePools as there is no image reference and the image is managed by AKS, so changing the k8s version on the MachinePool is enough to trigger a k8s version + image upgrade of the managed service. However, the same think might be true with non-managed clusters that specify a custom image when using MachinePools.

@mboersma wonder if you have any thoughts on this one, and whether #4063 would improve this situation or maybe make it easier to improve in the future?

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

shyamradhakrishnan commented 1 year ago

/remove-lifecycle stale

fabriziopandini commented 1 year ago

/triage accepted

Just for my understanding, is it possible to use the same "template rotation" approach used by other abstractions:

shyamradhakrishnan commented 1 year ago

@fabriziopandini the approach can be used but not currently as we dont use the concept of templates in machinepools, CAPI directly uses MachinePool infra objects. I think what you are suggesting is the right way forward and in sync with other places in CAPI.

shyamradhakrishnan commented 1 year ago

@fabriziopandini One more advantage is machinepool support in ClusterClass can be added with your suggestion.

k8s-triage-robot commented 9 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini commented 6 months ago

/priority important-longterm /kind api-change

fabriziopandini commented 6 months ago

@mboersma @willie-yao @Jont828 I think that addressing this issue - or the concerns behind it - should be taken into consideration before graduating MP to GA