rancher / cluster-api-provider-rke2

RKE2 bootstrap and control-plane Cluster API providers.
https://rancher.github.io/cluster-api-provider-rke2/
Apache License 2.0
84 stars 30 forks source link

innacurate rke2controlplane status when maxSurge is set to 0 #356

Open zioc opened 4 months ago

zioc commented 4 months ago

What happened:

This issue was observed while searching for a workaround for a this issue in sylva project: https://gitlab.com/sylva-projects/sylva-core/-/issues/1412

It is somehow related to https://github.com/rancher-sandbox/cluster-api-provider-rke2/issues/355

When controlplane is upgraded with following strategy:

    rolloutStrategy:
      rollingUpdate:
        maxSurge: 0

Controlplane Ready conditions is set to True whereas the last control plane machine is being rolled out:

NAME                                                  CLUSTER                           NODENAME                                          PROVIDERID                                                                                                                      PHASE      AGE     VERSION
mgmt-1353958806-rke2-capm3-virt-control-plane-kw7q4   mgmt-1353958806-rke2-capm3-virt   mgmt-1353958806-rke2-capm3-virt-management-cp-2   metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-2/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-n9sgv       Deleting   81m     v1.28.8
mgmt-1353958806-rke2-capm3-virt-control-plane-shl4r   mgmt-1353958806-rke2-capm3-virt   mgmt-1353958806-rke2-capm3-virt-management-cp-0   metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-0/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-nqls5       Running    8m56s   v1.28.8
mgmt-1353958806-rke2-capm3-virt-control-plane-w4ldw   mgmt-1353958806-rke2-capm3-virt   mgmt-1353958806-rke2-capm3-virt-management-cp-1   metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-1/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-422g7       Running    31m     v1.28.8

We can indeed see that even if mgmt-1353958806-rke2-capm3-virt-control-plane machine is being deleted, it still has a REady=Trus condition:

- apiVersion: cluster.x-k8s.io/v1beta1
  kind: Machine
  metadata:
    annotations:
      controlplane.cluster.x-k8s.io/rke2-server-configuration: '{"tlsSan":["172.18.0.2"],"disableComponents":{"pluginComponents":["rke2-ingress-nginx"]},"cni":"calico","etcd":{"backupConfig":{},"customConfig":{"extraArgs":["auto-compaction-mode=periodic","auto-compaction-retention=12h","quota-backend-bytes=4294967296"]}}}'
    creationTimestamp: "2024-06-29T21:20:17Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2024-06-29T22:40:07Z"
    finalizers:
    - machine.cluster.x-k8s.io
    generation: 2
    labels:
      cluster.x-k8s.io/cluster-name: mgmt-1353958806-rke2-capm3-virt
      cluster.x-k8s.io/control-plane: ""
      name: mgmt-1353958806-rke2-capm3-virt-control-plane-kw7q4
    namespace: sylva-system
    ownerReferences:
    - apiVersion: controlplane.cluster.x-k8s.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: RKE2ControlPlane
      name: mgmt-1353958806-rke2-capm3-virt-control-plane
      uid: 2603a3dc-5112-4203-a86e-39d1fe67f365
    resourceVersion: "313188"
    uid: 8eea3ee0-f40d-48fb-8244-6bef3b3491bd
  spec:
    bootstrap:
      configRef:
        apiVersion: bootstrap.cluster.x-k8s.io/v1alpha1
        kind: RKE2Config
        name: mgmt-1353958806-rke2-capm3-virt-control-plane-khrlc
        namespace: sylva-system
        uid: 9428a36d-d5b1-4cc2-a52f-60e4b9d47b0f
      dataSecretName: mgmt-1353958806-rke2-capm3-virt-control-plane-khrlc
    clusterName: mgmt-1353958806-rke2-capm3-virt
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: Metal3Machine
      name: mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-n9sgv
      namespace: sylva-system
      uid: 768f1aa0-502a-41bb-8904-ecdedbc1c553
    nodeDeletionTimeout: 10s
    nodeDrainTimeout: 5m0s
    providerID: metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-2/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-n9sgv
    version: v1.28.8
  status:
    addresses:
    - address: 192.168.100.12
      type: InternalIP
    - address: fe80::be48:57d4:3796:1ceb%ens5
      type: InternalIP
    - address: 192.168.10.12
      type: InternalIP
    - address: fe80::e84c:af6d:e97d:f1e9%ens4
      type: InternalIP
    - address: localhost.localdomain
      type: Hostname
    - address: localhost.localdomain
      type: InternalDNS
    bootstrapReady: true
    conditions:
    - lastTransitionTime: "2024-06-29T21:20:22Z"
      status: "True"
      type: Ready
    - lastTransitionTime: "2024-06-29T22:32:15Z"
      status: "True"
      type: AgentHealthy
    - lastTransitionTime: "2024-06-29T21:20:22Z"
      status: "True"
      type: BootstrapReady
    - lastTransitionTime: "2024-06-29T22:40:07Z"
      message: Draining the node before deletion
      reason: Draining
      severity: Info
      status: "False"
      type: DrainingSucceeded
    - lastTransitionTime: "2024-06-29T22:40:08Z"
      reason: Deleting
      severity: Info
      status: "False"
      type: EtcdMemberHealthy
    - lastTransitionTime: "2024-06-29T21:20:22Z"
      status: "True"
      type: InfrastructureReady
    - lastTransitionTime: "2024-06-29T22:32:16Z"
      status: "True"
      type: NodeHealthy
    - lastTransitionTime: "2024-06-29T21:20:23Z"
      status: "True"
      type: NodeMetadataUpToDate
    - lastTransitionTime: "2024-06-29T22:40:35Z"
      status: "True"
      type: PreDrainDeleteHookSucceeded

Consequently controller sets rke2controlplane as ready here as we have len(readyMachines)) == replicas

Shouldn't it instead check that the count of machines that are Ready and UpToDate matches the spec.replicas?

github-actions[bot] commented 14 hours ago

This issue is stale because it has been open 90 days with no activity.