rancher-sandbox / cluster-api-provider-rke2

RKE2 bootstrap and control-plane Cluster API providers.
Apache License 2.0
78 stars 24 forks source link

no control plane node rolling update is triggered on a change of preRKE2Commands #307

Closed tmmorin closed 1 month ago

tmmorin commented 2 months ago

I observed the following after adding a simple test command (echo 42 > /tmp/test) to preRKE2Commands in both my RKE2ControlPlane resource and the RKE2ConfigTemplate resource used for a MachineDeployment.

$ k get rke2controlplane management-cluster-control-plane -o yaml | yq .spec.preRKE2Commands[-1]                 
echo 42 > /tmp/test
$ k get rke2configtemplate management-cluster-md0-4a6c705d7e -o yaml | yq .spec.template.spec.preRKE2Commands[-1]                                 
echo 42 > /tmp/test

As expected, for the MachineDeployment a node rolling update was triggered.

All the RKE2Config resources for my MD have this test command:

$ k get rke2config -o yaml -l cluster.x-k8s.io/deployment-name=management-cluster-md0 | yq '.items[] | {"name":.metadata.name,"test":(.spec.preRKE2Commands[-1] | test('42'))}'
name: management-cluster-md0-4a6c705d7e-7dfxr
test: true
name: management-cluster-md0-4a6c705d7e-jdgcg
test: true
name: management-cluster-md0-4a6c705d7e-qk5xd
test: true

However, for the control plane, no rolling update was triggered.

The RKE2Config resources for the control plane don't have the test command:

$ k get rke2config -o yaml -l cluster.x-k8s.io/control-plane | yq '.items[] | {"name":.metadata.name,"test":(.spec.preRKE2Commands[-1] | test('42'))}'                               
name: management-cluster-control-plane-2qsws
test: false
name: management-cluster-control-plane-r5czr
test: false
name: management-cluster-control-plane-v48rj
test: false

The status of the RKE2ControlPlane is fully ready though, showing no sign of any rolling update being in progress:

$ k get rke2controlplane management-cluster-control-plane -o yaml | yq .status                                                                        
availableServerIPs:
  - 172.20.129.32
conditions:
  - lastTransitionTime: "2024-04-24T14:47:18Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-04-15T10:16:05Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2024-04-15T10:16:05Z"
    status: "True"
    type: CertificatesAvailable
  - lastTransitionTime: "2024-04-24T14:51:15Z"
    status: "True"
    type: ControlPlaneComponentsHealthy
  - lastTransitionTime: "2024-04-24T14:47:18Z"
    status: "True"
    type: MachinesReady
  - lastTransitionTime: "2024-04-24T14:46:29Z"
    status: "True"
    type: MachinesSpecUpToDate
  - lastTransitionTime: "2024-04-24T14:47:18Z"
    status: "True"
    type: Resized
initialized: true
observedGeneration: 11
ready: true
readyReplicas: 3
replicas: 3
updatedReplicas: 3

Of course, the expected behavior would be to have a rolling update being triggered.

Note that a rolling update is properly triggered on a change of, for instance spec.agentConfig.kubelet.extraArgs.

(The title of this issue is about "a change of preRKE2Commands", because I didn't try to be exhaustive in this bug report, but we observed the issue on other fields and it's likely not specific to preRKE2Commands)

tmmorin commented 2 months ago

hello @belgaied2 @richardcase @Danil-Grigorev -- fyi ^

we can workaround this limitation by triggerring the rolling update via an arbitrary change in some benign spec.agentConfig.kubelet.extraArgs, but this really isn't great, because there remains the issue that for the unaware user, a intend to change will silently fail to be applied