k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.13k stars 2.35k forks source link

control-plane-only nodes cannot start if join server is unreachable #11333

Open brandond opened 5 days ago

brandond commented 5 days ago

Environmental Info: K3s Version: n/a

Node(s) CPU architecture, OS, and Version: n/a

Cluster Configuration: cluster with dedicated etcd-only and control-plane-only servers

Describe the bug: Control-plane nodes are uniquely disadvantaged in their startup, and will fail to start if the server address is unreachable. Etcd nodes can always start up, as they can reconcile against their local datastore. Agents run a local load-balancer that caches server addresses across restarts, so they can always start up as long as at least one of the previously-known servers are up.

Steps To Reproduce:

  1. Create a cluster with 3 etcd, 2 control-plane, 1 agent. Use the first etcd node as the --server address for the other nodes.
  2. Stop k3s on the first etcd node.
  3. Note that etcd nodes can be restarted successfully, and agents can be restarted successfully, but control-plane nodes cannot. They fail with: level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: Get \"https://ETCD-SERVER-1:6443/cacerts\": dial tcp ETCD-SERVER-1:6443: connect: connection refused"

Expected behavior: All nodes can be restarted as long as at least one previously-known server is available.

Actual behavior: control-plane nodes fail to start if their join server is unavailable.

Additional context / logs:

Our docs say to use a fixed registration address that is backed by multiple servers, but that guidance is not always followed.

All other configurations are able to start up with at least one previously-discovered server available, this should work for control-plane nodes as well.