Instance in machine pool failed to join cluster withe error bootstrap token not found

archerwu9425 commented 1 month ago

What steps did you take and what happened?

After I used clusterctl move to migrated an existed workload cluster to a new management cluster, instance in aws machine pool failed to join the cluster and dead loop create/terminate ec2 instances.

Error log found in kubeadmin bootstrap controller:

E0802 08:52:55.552210       1 controller.go:329] "Reconciler error" err="failed to get bootstrap token secret in order to refresh it: secrets \"bootstrap-token-a0o08u\" not found" controller="kubeadmconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="KubeadmConfig" KubeadmConfig="kubed-08/kubed-08-worker-29521" namespace="kubed-08" name="kubed-08-worker-29521" reconcileID="e7d2eb93-428f-4d9c-b64c-95f639a586ff"

Root cause should be:

During clusterctl move, cluster will put on paused filed and stop reconciling
Due to some provider version issue, the move process failed the first time and took more time than usual
Default bootstrap TTL is being used for bootstrap controller, which is 15 mins, the token expired and get deleted in the workload cluster during the paused period
The machine pool size rang is 0-10, and we use cluster auto scaler in the workload cluster, which scaled up the machine pool from 0 to 1 during the paused period, brings the replicas for machinePool is 1 but no nodeRef in the machinePool status, refer to this code block: https://github.com/kubernetes-sigs/cluster-api/blob/v1.7.4/bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go#L274-L280

What did you expect to happen?

For the refreshBootstrapTokenIfNeeded function, if token not found, should create a new one instead of just raise error: https://github.com/kubernetes-sigs/cluster-api/blob/v1.7.4/bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go#L326-L329

Cluster API version

v1.7.4

Kubernetes version

v1.27.12

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

k8s-ci-robot commented 1 month ago

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

archerwu9425 commented 1 month ago

/area bootstrap

kubernetes-sigs / cluster-api