siderolabs / cluster-api-control-plane-provider-talos

A control plane provider for CAPI + Talos
Mozilla Public License 2.0
60 stars 20 forks source link

control plane rolling upgrades on spec changes #107

Closed smira closed 1 year ago

smira commented 2 years ago

Control Plane provider should support rolling out new set of control plane machines on spec changes (similar to MachineDeployment controller and kubeadm control plane provider):

E.g. see this code from kubeadm provider:

https://github.com/kubernetes-sigs/cluster-api/blob/cefc044b286676bf5f04a9b3e9009eb93c2a5329/controlplane/kubeadm/internal/controllers/upgrade.go#L33-L34

smira commented 2 years ago

Rollout strategy: https://github.com/kubernetes-sigs/cluster-api/blob/cefc044b286676bf5f04a9b3e9009eb93c2a5329/controlplane/kubeadm/api/v1beta1/kubeadm_control_plane_types.go#L106-L131

ErikLundJensen commented 1 year ago

This looks like an old issue. Has this been implemented in feat: support TalosControlPlane rolling upgrade ?

Given I have a cluster with 3 control plane nodes when I update the TalosControlPlane to reference a new infrastructureTemplate (for example with a new image template name) then a new control plane node is created as expected but the node does not join the existing cluster.

The result is 4 running control plane nodes. The rolling update does not seem to work.

This is my configuration using Talos and CAPV (VMware) as infrastructure: cluster yaml to render the standard.yaml into cluster.yaml run the commands:

. ./standard.env
envsubst < ../standard.yaml >cluster.yaml
smira commented 1 year ago

yes, this had been implemented long time ago. we have a test for the rollout of new cp nodes.

The rollout might stop if the controlplane is not healthy, it's the expected behavior.

If the new node doesn't join, there should be investigated first. Control plane resource status shows detailed information about failed checks, while Talos logs might show why it doesn't join.

Preisschild commented 1 year ago

One issue currently still existing is that a rollout is not triggered if you edit the talosconfig (.spec.controlPlaneConfig) in your TalosControlPlane resource.

It is only triggered if you edit the infrastructureRef (.spec.infrastructureTemplate) or kubernetes version (spec.version)

smira commented 1 year ago

Hm... That deserves a separate issue 😉

ErikLundJensen commented 1 year ago

The issue was cause by the VMware cloud provider (CPI). It requires the node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule to be set at new nodes. If this is not set then the ProviderId at the node is not set and thereby the nodeRef in Machine is neither set.

When I manually apply the taint to the new nodes then the rolling of control planes works. However, I do not see any configuration options to add custom taints to new nodes in Talos (at least through the Cluster API).

smira commented 1 year ago

kubelet supports taints via its config: https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/

and Talos provides a way to add extra config for the kubelet.

I'm going to close this issue, as it is unrelated and actually fixed in CACPPT, so let's move this to the new issue/Slack/discussion if it needs further investigation, thank you.