k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
26.62k stars 2.24k forks source link

Fail to run static pod without master running #10036

Closed liyimeng closed 2 weeks ago

liyimeng commented 2 weeks ago

Environmental Info: K3s Version:

1.28.6

Node(s) CPU architecture, OS, and Version:

4 core arm64 Cluster Configuration:

1 master + 1 worker Describe the bug:

I have a static pod running in each node, both nodes went into failure for a power lost. For some reason, the master node was not able to recovery, but worker node boot back as normal. However, k3s service stuck at

time="2024-04-28T02:53:14Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:33176->127.0.0.1:6444: read: connection reset by peer"

hence my static pod was not able to come back.

I am wondering if this is an expected behaviour. Could it will be nice to have containerd and kubelet up and run, then static pod start. Let the kubelet try to connecting back when master is available later on?

Steps To Reproduce:

Actual behavior:

k3s service on work node failed to start

Additional context / logs:


time="2024-04-28T06:07:34Z" level=info msg="Starting k3s agent v1.28.8+k3s-7be8d297 (7be8d297)"
time="2024-04-28T06:07:34Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 192.168.43.98:6443"
time="2024-04-28T06:07:34Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [192.168.43.98:6443] [default: 192.168.43.98:6443]"
time="2024-04-28T06:07:40Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:34606->127.0.0.1:6444: read: connection reset by peer"
time="2024-04-28T06:07:46Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:34610->127.0.0.1:6444: read: connection reset by peer"
time="2024-04-28T06:07:52Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:34616->127.0.0.1:6444: read: connection reset by peer"
time="2024-04-28T06:07:58Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:34622->127.0.0.1:6444: read: connection reset by peer"
liyimeng commented 2 weeks ago

I see we have allow this for etcdagent node, can it be extended to normal agent node? base on resource like this I would consider this as a case that k3s break k8s conformance

brandond commented 2 weeks ago

This is a duplicate of https://github.com/k3s-io/k3s/issues/1686

K3s agents do not start the container runtime or kubelet until a server is reachable to provide up-to-date certificates and configuration. Because the kubelet isn't started yet, there is no issue with Kubernetes conformance - conformance places no requirements on how a distribution operates prior to startup of Kubernetes itself, or what startup dependencies a distribution enforces.

Although there have been discussions on this topic in the past, we are not planning to relax this requirement at this time.