k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.94k stars 2.33k forks source link

[Release-1.29] - Agent loadbalancer may deadlock when servers are removed #10515

Closed brandond closed 3 months ago

brandond commented 3 months ago

Backport fix for Agent loadbalancer may deadlock when servers are removed

aganesh-suse commented 3 months ago

Validated on release-1.29 branch with version v1.29.7-rc1+k3s1

Environment Details

Infrastructure

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA: 3 server/ 1 agent

Config.yaml:

token: xxxx
cluster-init: true
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
node-label:
- k3s-upgrade=server

Testing Steps

  1. Copy config.yaml
    $ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s
  2. Install k3s
    curl -sfL https://get.k3s.io | sudo INSTALL_K3S_VERSION='v1.29.7-rc1+k3s1' sh -s - server
  3. Verify Cluster Status:
    kubectl get nodes -o wide
    kubectl get pods -A
  4. Identify the server that the agent is connected to : netstat -na | grep 6443
  5. Disconnect the network on that server: ip link set dev eth0 down (or whatever interface that node is using).
  6. Look up the journal logs for a loadbalancer update happening.

Replication Results:

level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp <ip1>:6443: connect: connection timed out" url="wss://<ip1>:6443/v1-k3s/connect"
level=info msg="Connecting to proxy" url="wss://<ip1>:6443/v1-k3s/connect"
level=debug msg="Failed over to new server for load balancer k3s-agent-load-balancer: <ip1>:6443 -> <ip2>:6443"

Validation Results:

level=info msg="Removing server from load balancer k3s-agent-load-balancer: <ip1>:6443"
level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [<ip2>:6443 <ip3>:6443] [default: <ip1>:6443]"