k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.87k stars 2.33k forks source link

k3s-agent panics and exists unclean after network interruptions #10981

Closed PeaceRebel closed 2 weeks ago

PeaceRebel commented 2 weeks ago

Environmental Info: K3s Version: v1.29.4+k3s1

Node(s) CPU architecture, OS, and Version: Linux my-edge-host 6.10.7-200.fc40.aarch64 # 1 SMP PREEMPT_DYNAMIC Fri Aug 30 00:37:24 UTC 2024 aarch64 GNU/Linux

Cluster Configuration: 1 server in AWS and 32 agents nodes (includes amd64 and aarch64 machines)

Describe the bug: The agent nodes are on the edge and has occasional network interruptions and can be out for a few hours. k3s-agent keeps trying to reach the server but after a point it seems to panic and agent crashes (unclean exit). After this I'm seeing failed to get CA certs error. This results in the agent not connecting to the server once the network is stable. We that the server is fine as the other edge nodes are healthy.

Steps To Reproduce:

Expected behavior: Agent shouldn't panic and exists should be cleaner.

Actual behavior: k3s-agent crashes and needs to restart the service for it to report back to server after network is stable.

Additional context / logs:

Oct 02 18:20:28 my-edge-host k3s[1525]: time="2024-10-02T18:20:28Z" level=info msg="Connecting to proxy" url="wss://<server-ip>:6443/v1-k3s/connect"
Oct 02 18:20:28 my-edge-host k3s[1525]: time="2024-10-02T18:20:28Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp <server-ip>:6443: connect: network is unreachable"
Oct 02 18:20:28 my-edge-host k3s[1525]: time="2024-10-02T18:20:28Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp <server-ip>:6443: connect: network is unreachable" url="wss://<server-ip>:6443/v1-k3s/connect"
Oct 02 18:20:28 my-edge-host k3s[1525]: panic: runtime error: index out of range [2] with length 2
Oct 02 18:20:28 my-edge-host k3s[1525]: goroutine 213962 [running]:
Oct 02 18:20:28 my-edge-host k3s[1525]: github.com/k3s-io/k3s/pkg/agent/loadbalancer.(*LoadBalancer).nextServer(0x40004d3420, {0x4000700648?, 0x4000700648?})
Oct 02 18:20:28 my-edge-host k3s[1525]:         /go/src/github.com/k3s-io/k3s/pkg/agent/loadbalancer/servers.go:131 +0x2a0
Oct 02 18:20:28 my-edge-host k3s[1525]: github.com/k3s-io/k3s/pkg/agent/loadbalancer.(*LoadBalancer).dialContext(0x40004d3420, {0x6a37140, 0x400054eb60?}, {0x5ab6058, 0x3}, {0x0?, 0x0?})
Oct 02 18:20:28 my-edge-host k3s[1525]:         /go/src/github.com/k3s-io/k3s/pkg/agent/loadbalancer/loadbalancer.go:176 +0x248
Oct 02 18:20:28 my-edge-host k3s[1525]: inet.af/tcpproxy.(*DialProxy).HandleConn(0x400066d8c0, {0x6a54130, 0x4002866188})
Oct 02 18:20:28 my-edge-host k3s[1525]:         /go/pkg/mod/inet.af/tcpproxy@v0.0.0-20200125044825-b6bb9b5b8252/tcpproxy.go:359 +0xf0
Oct 02 18:20:28 my-edge-host k3s[1525]: inet.af/tcpproxy.(*Proxy).serveConn(0x50bdd40?, {0x6a54130?, 0x4002866188}, {0x400088d190, 0x1, 0x4000aca060?})
Oct 02 18:20:28 my-edge-host k3s[1525]:         /go/pkg/mod/inet.af/tcpproxy@v0.0.0-20200125044825-b6bb9b5b8252/tcpproxy.go:239 +0x28c
Oct 02 18:20:28 my-edge-host k3s[1525]: created by inet.af/tcpproxy.(*Proxy).serveListener in goroutine 275
Oct 02 18:20:28 my-edge-host k3s[1525]:         /go/pkg/mod/inet.af/tcpproxy@v0.0.0-20200125044825-b6bb9b5b8252/tcpproxy.go:221 +0x40
Oct 02 18:20:28 my-edge-host systemd[1]: k3s-agent.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ An ExecStart= process belonging to unit k3s-agent.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 2.
Oct 02 18:20:28 my-edge-host systemd[1]: k3s-agent.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit k3s-agent.service has entered the 'failed' state with result 'exit-code'.
Oct 02 18:20:28 my-edge-host systemd[1]: k3s-agent.service: Unit process 4511 (containerd-shim) remains running after unit stopped.
Oct 02 18:20:28 my-edge-host systemd[1]: k3s-agent.service: Unit process 4512 (containerd-shim) remains running after unit stopped.
Oct 02 18:20:28 my-edge-host systemd[1]: k3s-agent.service: Consumed 9min 39.280s CPU time, 304.2M memory peak, 0B memory swap peak.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit k3s-agent.service completed and consumed the indicated resources.
Oct 02 18:20:33 my-edge-host systemd[1]: k3s-agent.service: Scheduled restart job, restart counter is at 1.
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ Automatic restarting of the unit k3s-agent.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
Oct 02 18:20:33 my-edge-host systemd[1]: k3s-agent.service: Found left-over process 4511 (containerd-shim) in control group while starting unit. Ignoring.
Oct 02 18:20:33 my-edge-host systemd[1]: k3s-agent.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 02 18:20:33 my-edge-host systemd[1]: k3s-agent.service: Found left-over process 4512 (containerd-shim) in control group while starting unit. Ignoring
Oct 02 18:20:33 my-edge-host systemd[1]: Starting k3s-agent.service - Lightweight Kubernetes...
░░ Subject: A start job for unit k3s-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit k3s-agent.service has begun execution.
░░
░░ The job identifier is 11608.
Oct 02 18:20:33 my-edge-host sh[15540]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Oct 02 18:20:34 my-edge-host systemd[1]: k3s-agent.service: Found left-over process 4511 (containerd-shim) in control group while starting unit. Ignoring.
Oct 02 18:20:34 my-edge-host systemd[1]: k3s-agent.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 02 18:20:34 my-edge-host systemd[1]: k3s-agent.service: Found left-over process 4512 (containerd-shim) in control group while starting unit. Ignoring.
Oct 02 18:20:34 my-edge-host systemd[1]: k3s-agent.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=info msg="Starting k3s agent v1.29.4+k3s1 (94e29e2e)"
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: server.my-aws-server.com:6443"
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: <server-ip>:6443"
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=info msg="Removing server from load balancer k3s-agent-load-balancer: server.my-aws-server.com:6443"
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [<server-ip>:6443] [default: server.my-aws-server.com:6443]"
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:44428->127.0.0.1:6444: read: connection reset by peer"
brandond commented 2 weeks ago

This was fixed in June, please update to a newer release.