kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.55k stars 1.3k forks source link

Kubeadm control plane controller logs filled with timeout error #2408

Closed maelk closed 4 years ago

maelk commented 4 years ago

What steps did you take and what happened: When deployed with CAPBM, no external load-balancer is created by the infrastructure provider. When KubeadmControlplane controller creates the first machine, it is unable to create the client towards the target cluster and fails with

failed to create remote cluster client: failed to create client for Cluster default/test1: Get https://192.168.111.249:6443/api?timeout=32s: dial tcp 192.168.111.249:6443: i/o timeout

The controller is expecting a 404, if the load-balancer is set-up. But it is not expecting a timeout when the control plane load-balancer is self-hosted. Hence the failure. The KCP is also not updated due to the failure.

What did you expect to happen: Optimally, the control plane would be able to expect a timeout, and update the KCP status without filling the logs of errors.

Anything else you would like to add: This can be done by adding !apierrors.IsTimeout(errors.Cause(err)) on line https://github.com/kubernetes-sigs/cluster-api/blob/master/controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go#L300 and https://github.com/kubernetes-sigs/cluster-api/blob/master/controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go#L787

However, the timeout should maybe still be an error when an external Load-balancer is deployed, so maybe a field could be added in the KCP CRD to be able to configure this ?

Environment:

/kind bug

maelk commented 4 years ago

Would adding a field to configure whether the controller expects a timeout or not be an ok solution ? I can implement it if so. Otherwise, would there be another solution, that does not require managing a very short-lived load-balancer until the first control-plane node is ready ?

detiber commented 4 years ago

I added some comments to the PR you created. If we can address those, I think adding a check for the timeout error is fine. One thing to keep in mind, though is that a stable endpoint will be required, so if the migration to the self-hosted load balancer would change that endpoint, then it would introduce additional issues during the lifecycle of the control plane.

detiber commented 4 years ago

/priority important-soon /assign @maelk

detiber commented 4 years ago

/lifecycle active