Enhanced kubernetes version upgrades for workload clusters

JTarasovic commented 4 years ago

User Story

As an operator, I would like to be able to easily update the Kubernetes version of my workload clusters to be able to stay on top of security patches and new features.

Detailed Description

The procedure for updating the k8s version currently* is to copy the MachineTemplate for KCP, update KCP w/ new version and reference to new MachineTemplate which causes a rollout. Rinse and repeat for MachineDeployments.

https://cluster-api.sigs.k8s.io/tasks/kubeadm-control-plane.html

Ideally, I'd be able to declare my intent to upgrade the workload cluster and that would be reconciled and rolled out for me.

Anything else you would like to add:

Discussed on 17 June 2020 weekly meeting.

/kind feature

fabriziopandini commented 4 years ago

This issue requires a certain degree of coordination across several components, so the first question in my mind is where to implement this logic. I don't think this should go at cluster level, because cluster main responsibility is the cluster infrastructure, so what about assuming this should be implemented in separated extension (with it's own CRD/controller)?

vincepri commented 4 years ago

/milestone v0.4.0

We should revisit in v1alpha4 timeframe, probably needs a more detailed proposal

CecileRobertMichon commented 4 years ago

cc @rbitia

Ria, this might fit into your "cluster group" proposal?

JTarasovic commented 4 years ago

We have a relatively small (but growing) number of clusters so we're currently doing upgrades sort of manually. Conceptually, we think about our clusters in 3 streams - alpha, beta and stable - and roll out upgrades and configuration changes according to stream.

Our plan right now is to have common configuration for a stream in a CR (StreamConfig) w/ a controller. The StreamConfig controller would reconcile to ClusterConfigs based on label / annotation with its controller handling the actual cluster resource reconciliation (eg creation, k8s version upgrades, etc).¹

I don't think that it's CAPIs responsibility to implement all of that (or any) but if we can do some of the common stuff (version upgrades) here, that seems like it would be super valuable for the whole community. It also seems like the logic would be broadly applicable - copy template, update KCP, rollout, copy template, update MDs, rollout, profit².

¹Names are illustrative and not definitive. Something, something hard problems in Computer Science. ²Grossly over-simplified here for effect.

vincepri commented 4 years ago

Thanks for the extra context @JTarasovic, from everything I'm hearing here it might be worth considering some extra utilities/libraries/commands under clusterctl which could perform some variations of the concepts described above.

seh commented 4 years ago

Ideally, I'd be able to declare my intent to upgrade the workload cluster and that would be reconciled and rolled out for me.

I find that if I change the "spec.version" field in an existing KubeadmControlPlane object and apply the change, usually the controllers will upgrade my control plane, without me introducing a new (AWS)MachineTemplate. It sounds like that's not supposed to work, and yet it does—most of the time. Why is that?

JTarasovic commented 4 years ago

Does it actually change the version of the running cluster - eg kubectl get no -o wide shows the new version?

It did not in our experience. It would roll the control plane instances but they'd still be on the previous version.

CecileRobertMichon commented 4 years ago

This is how upgrading k8s version on control planes works currently: https://cluster-api.sigs.k8s.io/tasks/kubeadm-control-plane.html?highlight=rolling#how-to-upgrade-the-kubernetes-control-plane-version

Note that you might need to update the image as well if you are specifying the image to use in the machine template.

seh commented 4 years ago

Does it actually change the version of the running cluster - eg kubectl get no -o wide shows the new version?

Yes, it shows the new version there.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

vincepri commented 3 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

vincepri commented 3 years ago

Any updates or actions items here?

JTarasovic commented 3 years ago

I think the clusterctl rollout issue linked above is a good first approximation but I agree w/ @detiber's comment there:

propose support in upstream Kubernetes/kubectl/kubebuilder for a sub-resource type

as that should allow folks to build controllers on top of it.

I'm cool with closing this issue in favor of that.

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

CecileRobertMichon commented 3 years ago

I think the clusterctl rollout feature doesn't solve the problem of having to update the image + k8s version for every machine deployment / machine pool / kubeadm control plane that you want to upgrade as a user, although it does give more control on the rollout of machines. It would still be nice to have some sort of higher order "upgrade my cluster" automation. @craiglpeters @devigned and I were discussing this earlier today and one thing that came up was maybe having a way to tell your management cluster which image to use for which k8s version and having the machine template look that up instead of having to individually update the image version on each cluster. This would also allow patching images across all your clusters if you have to rebuild an image for the same k8s version (eg. because of a CVE).

CecileRobertMichon commented 3 years ago

/remove-lifecycle rotten

fiunchinho commented 3 years ago

We have a relatively small (but growing) number of clusters so we're currently doing upgrades sort of manually. Conceptually, we think about our clusters in 3 streams - alpha, beta and stable - and roll out upgrades and configuration changes according to stream.

Our plan right now is to have common configuration for a stream in a CR (StreamConfig) w/ a controller. The StreamConfig controller would reconcile to ClusterConfigs based on label / annotation with its controller handling the actual cluster resource reconciliation (eg creation, k8s version upgrades, etc).

I don't think that it's CAPIs responsibility to implement all of that (or any) but if we can do some of the common stuff (version upgrades) here, that seems like it would be super valuable for the whole community. It also seems like the logic would be broadly applicable - copy template, update KCP, rollout, copy template, update MDs, rollout, profit.

We are in a really similar situation with a large number of clusters and three different pipelines/streams for development/staging/production clusters. We are starting the development of a new component to handle this in a similar fashion (copy template, update KCP, update MachinePool, etc), so it'd be great if we could share tooling. We were also interested in making this component capable of orchestrating this upgrade process so we could, for instance, decide to upgrade node pools one after the other, with some wait period in between, instead of all at once.

If I understand it correctly, this proposal adds kubectl rollout like subcommands to clusterctl but this wouldn't solve the use cases discussed above.

Should we submit a new CAEP proposal for discussion?

enxebre commented 3 years ago

Same use case here, looping over machine compute scalable resources e.g machineDeployments to upgrade them one by one against the current control plane version.

For scenarios where more control is required it'd be possibly good to have autoUpgrade: false/true control per machine scalable resource. So you can leveraged are more controlled upgrade for a given machine pool e.g https://github.com/kubernetes-sigs/cluster-api/pull/4346

smcaine commented 3 years ago

we have similar use case, we are using gitops + capi, to upgrade our clusters, for now we have to create new machinetemplate, update kcp, wait for that to finish delete old template, create new machinetemplate for machinedeployment, wait for rollout, delete old machinetemplate.. an operator or additional feature/resource that could handle this lifecycle as a whole (declaritively) would be ideal for us, so we can upgrade the KCP and machinedeployments machinetemplate references at same time and let the cluster reconcile and upgrade the controlplane and workers in correct order, then purge unwanted machinetemplates

enxebre commented 3 years ago

This relates to the cluster Class discussion https://github.com/kubernetes-sigs/cluster-api/issues/4430. This will require a considerable amount of work and thinking to get it right. @vincepri is this work still intended to make it to v1alpha4 or can we move it to next milestone?

/area upgrades

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fabriziopandini commented 3 years ago

What about closing this given the ClusterClass work?

sbueringer commented 3 years ago

Agree. This will be 100% covered by what we want to do with ClusterClass.

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

fabriziopandini commented 3 years ago

/close As per comment above this is part of ClusterClass; ongoing work in https://github.com/kubernetes-sigs/cluster-api/pull/5059

k8s-ci-robot commented 3 years ago

@fabriziopandini: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/3203#issuecomment-903560795): >/close >As per comment above this is part of ClusterClass; ongoing work in https://github.com/kubernetes-sigs/cluster-api/pull/5059 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api

Enhanced kubernetes version upgrades for workload clusters #3203