v0.3.x: Block creation of /upgrades to Kubernetes v1.22 management clusters

sbueringer commented 3 years ago

User Story

As a user I would like to get an error as early as possible when trying to upgrade a management cluster to v1.22 (using CAPI v0.3.x)

Detailed Description

There are different ways how a Kubernetes v1.22 management cluster could be created/updated:

create a new v1.22 workload cluster + clusterctl move / init / ..
upgrade a self-hosted <v1.22 management cluster via CAPI to v1.22

Anything else you would like to add:

Should be easy to implement by adding a version check in clusterctl v0.3.x
There is currently no easy way to detect if the management cluster tries to upgrade a workload cluster or itself. Happy for suggestions, otherwise I'll explore some options.

Open questions

What should happen to already existing v0.3.x 1.22 management clusters? Do we have to account for them / do they even work?

[Miscellaneous information that will assist in solving the issue.]

/kind feature

sbueringer commented 3 years ago

Some ideas for 2.:

Assumption: the first step in updating a cluster to v1.22 is changing the version in KubeadmControlPlane and then later on MachineDeployment etc.. I think there's nothing we can do about other control plane providers.

Some ideas:

Hand over a client to the KubeadmControlPlane webhook. In the update validation we would block the update:

if old version < v1.22 new version >= v1.22.0
if the KubeadmControlPlane refers to the current cluster, i.e. it's self-hosted. Options:
- get the control plane endpoint from the cluster and compare against the API server endpoint the client is using
- get the control plane machines and compare them against the nodes in the current cluster (.status.nodeRef.name and .status.nodeRef.uid?)

Open questions:

Do we want to implement the same "update blocking" for MachineDeployment, etc. ?
What do we want to do with management clusters which are already on v1.22 (in case that's possible)? Options:
- block only updates from < v1.22 to >= v1.22.0
- block all updates to versions >= v1.22.0, including v1.21. => v1.22., v1.22.x => v1.22.y (y>x) and v1.22.x => v1.23.x

sbueringer commented 3 years ago

/cc @fabriziopandini @randomvariable @vincepri @CecileRobertMichon

fabriziopandini commented 3 years ago

I'm not sure that blocking in web hooks is a viable option for v0.3.x, because this most probably requires controller runtime changes, and I don't think we can get them in the version currently used in this branch.

This leaves us to being forced to block in the controllers.

I agree the control plane controller is the first place where upgrade are triggered so I'm +1 to block here; We can raise awareness for other CP providers to do the same
For MD deployments, I think that we should prevent to rollout machines if the Kubernetes version is higher than the minimal API server version, but given that AFAIK there is no actually a way to determine this information, we are kind of falling back on CP provider implementation, which can't really work here 🤔

vincepri commented 3 years ago

/milestone v0.3

sbueringer commented 3 years ago

Regarding: "What do we want to do with management clusters which are already on v1.22?"

I tried to deploy CAPI v0.3.21 on Kubernetes v1.22.0-rc.0 and imho it's impossible to get it to work without any major changes to CAPI. So I think we can assume we don't have any existing v1.22 CAPI v0.3.x management cluster out there.

Exploration CAPI v0.3.21 on Kubernetes v1.22-rc.0

build kindest/node with v1.22.0-rc.0 and created kind cluster with it
clusterctl init --infrastructure docker (with clusterctl v0.3.21):
- cert-manager-cainjector does not start:
```
error registering secret controller: no matches for kind "MutatingWebhookConfiguration" in version "admissionregistration.k8s.io/v1beta1"
```
  Follow-up: We have to upgrade cert-manager on main: #4983
kubectl apply cluster-api-components.yaml: (adjust feature gate parameters manually)
- ValidatingWebhookConfiguration/MutatingWebhookConfiguration cannot be deployed as they are v1beta1
- cert-manager.io/v1alpha2 Certificate + Issuer cannot be deployed anymore (because cert-manager webhook is not up)
- create cert and deploy via: k create secret generic -n capi-system capi-kubeadm-bootstrap-webhook-service-cert --from-file=tls.key --from-file tls.crt
- capi-controller-manager does not start: reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.9/tools/cache/reflector.go:105: Failed to list *v1alpha3.Machine: Internal error occurred: error resolving resource

sbueringer commented 3 years ago

Note from the CAPI meeting: we should also update the following pages in the book:

Quickstart: https://release-0-3.cluster-api.sigs.k8s.io/user/quick-start.html
Version Support: https://release-0-3.cluster-api.sigs.k8s.io/reference/versions.html

sbueringer commented 3 years ago

/assign

randomvariable commented 3 years ago

I agree that a management cluster already on v1.22 probably already is going to need manual remediation.

sbueringer commented 3 years ago

@fabriziopandini @vincepri @randomvariable I would then start implementing the part in the KubeadmControlPlane controller. As we assume we don't have healthy v1.22 management clusters out there I would implement the following:

if self hosted && new KCP.spec.version >= v1.22 => return err

For if self hosted, I can think of the following options:

propagate KCP controller metadata.{uid,name,namespace} via downwards API so that the controller can try to find itself in the workload cluster (from which we can infer that the KCP controller runs in the management and workload cluster so => self hosted)
get the control plane machines and check for all if all .status.nodeRef.{name,uid} also exist in the current cluster via the managementCluster client
a few more other really bad hacks I even like less :)

Maybe I'm missing the one obvious and good solution :)

Discarded options:

get the control plane endpoint from the cluster and compare against the API server endpoint the client is using
- this doesn't work as the API server endpoint the client is using is most probably an in-cluster IP

vincepri commented 3 years ago

Could we check if the Cluster API CRDs are installed and block the upgrade to v1.22?

sbueringer commented 3 years ago

Could we check if the Cluster API CRDs are installed and block the upgrade to v1.22?

We could, this would additionally also block when we update other management clusters (i.e. not only ourselves, in cases like mgmt cluster => mgmt cluster => workload cluster). But I think this is also a case which would be nice to cover (and it wouldn't be covered by my solutions).

So yup, this seems to be the best solution yet.

vincepri commented 3 years ago

We can use the partial object metadata client to find the Cluster CRD and block there. We'd need a remote client to the workload cluster, try to retrieve the CRD with PartialObjectMetadata (see the convert references function as an example) and if the call is successful, assume it's a management cluster.

vincepri commented 3 years ago

/close

k8s-ci-robot commented 3 years ago

@vincepri: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/4966#issuecomment-915415431): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api

v0.3.x: Block creation of /upgrades to Kubernetes v1.22 management clusters #4966

Exploration CAPI v0.3.21 on Kubernetes v1.22-rc.0