Open zioc opened 4 months ago
What happened:
This issue was observed while searching for a workaround for a this issue in sylva project: https://gitlab.com/sylva-projects/sylva-core/-/issues/1412
It is somehow related to https://github.com/rancher-sandbox/cluster-api-provider-rke2/issues/355
When controlplane is upgraded with following strategy:
rolloutStrategy: rollingUpdate: maxSurge: 0
Controlplane Ready conditions is set to True whereas the last control plane machine is being rolled out:
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION mgmt-1353958806-rke2-capm3-virt-control-plane-kw7q4 mgmt-1353958806-rke2-capm3-virt mgmt-1353958806-rke2-capm3-virt-management-cp-2 metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-2/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-n9sgv Deleting 81m v1.28.8 mgmt-1353958806-rke2-capm3-virt-control-plane-shl4r mgmt-1353958806-rke2-capm3-virt mgmt-1353958806-rke2-capm3-virt-management-cp-0 metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-0/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-nqls5 Running 8m56s v1.28.8 mgmt-1353958806-rke2-capm3-virt-control-plane-w4ldw mgmt-1353958806-rke2-capm3-virt mgmt-1353958806-rke2-capm3-virt-management-cp-1 metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-1/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-422g7 Running 31m v1.28.8
We can indeed see that even if mgmt-1353958806-rke2-capm3-virt-control-plane machine is being deleted, it still has a REady=Trus condition:
mgmt-1353958806-rke2-capm3-virt-control-plane
- apiVersion: cluster.x-k8s.io/v1beta1 kind: Machine metadata: annotations: controlplane.cluster.x-k8s.io/rke2-server-configuration: '{"tlsSan":["172.18.0.2"],"disableComponents":{"pluginComponents":["rke2-ingress-nginx"]},"cni":"calico","etcd":{"backupConfig":{},"customConfig":{"extraArgs":["auto-compaction-mode=periodic","auto-compaction-retention=12h","quota-backend-bytes=4294967296"]}}}' creationTimestamp: "2024-06-29T21:20:17Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2024-06-29T22:40:07Z" finalizers: - machine.cluster.x-k8s.io generation: 2 labels: cluster.x-k8s.io/cluster-name: mgmt-1353958806-rke2-capm3-virt cluster.x-k8s.io/control-plane: "" name: mgmt-1353958806-rke2-capm3-virt-control-plane-kw7q4 namespace: sylva-system ownerReferences: - apiVersion: controlplane.cluster.x-k8s.io/v1alpha1 blockOwnerDeletion: true controller: true kind: RKE2ControlPlane name: mgmt-1353958806-rke2-capm3-virt-control-plane uid: 2603a3dc-5112-4203-a86e-39d1fe67f365 resourceVersion: "313188" uid: 8eea3ee0-f40d-48fb-8244-6bef3b3491bd spec: bootstrap: configRef: apiVersion: bootstrap.cluster.x-k8s.io/v1alpha1 kind: RKE2Config name: mgmt-1353958806-rke2-capm3-virt-control-plane-khrlc namespace: sylva-system uid: 9428a36d-d5b1-4cc2-a52f-60e4b9d47b0f dataSecretName: mgmt-1353958806-rke2-capm3-virt-control-plane-khrlc clusterName: mgmt-1353958806-rke2-capm3-virt infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: Metal3Machine name: mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-n9sgv namespace: sylva-system uid: 768f1aa0-502a-41bb-8904-ecdedbc1c553 nodeDeletionTimeout: 10s nodeDrainTimeout: 5m0s providerID: metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-2/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-n9sgv version: v1.28.8 status: addresses: - address: 192.168.100.12 type: InternalIP - address: fe80::be48:57d4:3796:1ceb%ens5 type: InternalIP - address: 192.168.10.12 type: InternalIP - address: fe80::e84c:af6d:e97d:f1e9%ens4 type: InternalIP - address: localhost.localdomain type: Hostname - address: localhost.localdomain type: InternalDNS bootstrapReady: true conditions: - lastTransitionTime: "2024-06-29T21:20:22Z" status: "True" type: Ready - lastTransitionTime: "2024-06-29T22:32:15Z" status: "True" type: AgentHealthy - lastTransitionTime: "2024-06-29T21:20:22Z" status: "True" type: BootstrapReady - lastTransitionTime: "2024-06-29T22:40:07Z" message: Draining the node before deletion reason: Draining severity: Info status: "False" type: DrainingSucceeded - lastTransitionTime: "2024-06-29T22:40:08Z" reason: Deleting severity: Info status: "False" type: EtcdMemberHealthy - lastTransitionTime: "2024-06-29T21:20:22Z" status: "True" type: InfrastructureReady - lastTransitionTime: "2024-06-29T22:32:16Z" status: "True" type: NodeHealthy - lastTransitionTime: "2024-06-29T21:20:23Z" status: "True" type: NodeMetadataUpToDate - lastTransitionTime: "2024-06-29T22:40:35Z" status: "True" type: PreDrainDeleteHookSucceeded
Consequently controller sets rke2controlplane as ready here as we have len(readyMachines)) == replicas
len(readyMachines)) == replicas
Shouldn't it instead check that the count of machines that are Ready and UpToDate matches the spec.replicas?
This issue is stale because it has been open 90 days with no activity.
What happened:
This issue was observed while searching for a workaround for a this issue in sylva project: https://gitlab.com/sylva-projects/sylva-core/-/issues/1412
It is somehow related to https://github.com/rancher-sandbox/cluster-api-provider-rke2/issues/355
When controlplane is upgraded with following strategy:
Controlplane Ready conditions is set to True whereas the last control plane machine is being rolled out:
We can indeed see that even if
mgmt-1353958806-rke2-capm3-virt-control-plane
machine is being deleted, it still has a REady=Trus condition:Consequently controller sets rke2controlplane as ready here as we have
len(readyMachines)) == replicas
Shouldn't it instead check that the count of machines that are Ready and UpToDate matches the spec.replicas?