rancher-sandbox / cluster-api-provider-rke2

RKE2 bootstrap and control-plane Cluster API providers.
Apache License 2.0
78 stars 24 forks source link

Controller is not scaling-up degraded control plane #352

Open zioc opened 1 week ago

zioc commented 1 week ago

What happened:

On a degraded cluster, 2 control-plane nodes out of 3 were unhealthy. One CP machine has been deleted, but it has not been re-created by control plane controller, we were observing the following logs:

Scaling up control plane" "Desired"=3 "Existing"=2 "RKE2ControlPlane

But right after that, we can see that control plane was not scaled up because of the following check:

"Waiting for control plane to pass preflight checks" [...] "failures"="machine management-cluster-control-plane-dklp7 reports AgentHealthy condition is false (Error, Missing node)"

Is is on purpose? Wouldn't it be legitimate to scale-up control-plane anyway in such cases? Even if some node is not healthy, wouldn't it be worth creating a new machine to match the requested number of replicas?

Here a more complete log, we see that machine has been generated as soon as the second unhealthy CP node has been deleted (at the end):

[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:45:24.724005       1 rke2controlplane_controller.go:387]  "msg"="Reconcile RKE2 Control Plane" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="fb967ad8-a26d-4431-bef7-5ae459a10cde"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:45:24.775706       1 rke2controlplane_controller.go:510]  "msg"="Scaling up control plane" "Desired"=3 "Existing"=2 "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="fb967ad8-a26d-4431-bef7-5ae459a10cde"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:45:24.775772       1 scale.go:225]  "msg"="Waiting for control plane to pass preflight checks" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "failures"="machine management-cluster-control-plane-dklp7 reports AgentHealthy condition is false (Error, Missing node)" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="fb967ad8-a26d-4431-bef7-5ae459a10cde"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:01.000267       1 rke2controlplane_controller.go:387]  "msg"="Reconcile RKE2 Control Plane" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="c3a6ad0b-bfd1-4d74-99ad-62e6c7b94e5b"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:01.044557       1 rke2controlplane_controller.go:510]  "msg"="Scaling up control plane" "Desired"=3 "Existing"=2 "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="c3a6ad0b-bfd1-4d74-99ad-62e6c7b94e5b"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:01.045034       1 scale.go:225]  "msg"="Waiting for control plane to pass preflight checks" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "failures"="machine management-cluster-control-plane-dklp7 reports AgentHealthy condition is false (Error, Missing node)" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="c3a6ad0b-bfd1-4d74-99ad-62e6c7b94e5b"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:39.960985       1 rke2controlplane_controller.go:387]  "msg"="Reconcile RKE2 Control Plane" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="b6992d19-b1aa-4ea3-91e4-382a84f512d0"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:40.009671       1 rke2controlplane_controller.go:510]  "msg"="Scaling up control plane" "Desired"=3 "Existing"=2 "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="b6992d19-b1aa-4ea3-91e4-382a84f512d0"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:40.084152       1 scale.go:402]  "msg"="Version checking..." "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "machine-version: "="1.28.8" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="b6992d19-b1aa-4ea3-91e4-382a84f512d0" "rke2-version"="v1.28.8+rke2r1"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:40.084188       1 scale.go:425]  "msg"="generating machine:" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "machine-spec-version"="1.28.8" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="b6992d19-b1aa-4ea3-91e4-382a84f512d0"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:40.156976       1 rke2controlplane_controller.go:387]  "msg"="Reconcile RKE2 Control Plane" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="2040b8a7-d273-458c-805a-cf45dc0b2e57"

Environment:

sylva 1.1.0 rke2 + capo