rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.55k stars 267 forks source link

[Release-1.27] - Agent loadbalancer may deadlock when servers are removed #6324

Closed brandond closed 2 months ago

brandond commented 3 months ago

Backport fix for Agent loadbalancer may deadlock when servers are removed

aganesh-suse commented 3 months ago

sorry posted k3s results here and closed by mistake. hence re-opening (deleted the k3s results). will update with rke2 results and close next week.

aganesh-suse commented 2 months ago

Validated on release-1.27 branch with version v1.27.16-rc4+rke2r1

Environment Details

Infrastructure

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA : 3 server / 1 agent

Config.yaml:

token: xxxx
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1

Testing Steps

  1. Copy config.yaml
    $ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2
  2. Install RKE2
    curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION='v1.27.16-rc4+rke2r1' INSTALL_RKE2_TYPE='server' INSTALL_RKE2_METHOD=tar sh -
  3. Start the RKE2 service
    $ sudo systemctl enable --now rke2-server
    or 
    $ sudo systemctl enable --now rke2-agent
  4. Verify Cluster Status:
    kubectl get nodes -o wide
    kubectl get pods -A
  5. Refer to verification steps here: https://github.com/k3s-io/k3s/pull/10511 Identify the server that the agent is connected to : netstat -na | grep 6443 Disconnect the network on that server: ip link set dev eth0 down (or whatever interface that node is using) The failed server should get removed from the server list

Replication Results:

level=error msg="Remotedialer proxy error; reconnecting..." error="websocket: close 1006 (abnormal closure): unexpected EOF" url="wss://<ip1>:9345/v1-rke2/connect"
level=info msg="Closing 1 connections to load balancer server <ip1>:6443"
level=info msg="Connecting to proxy" url="wss://<ip1>:9345/v1-rke2/connect"
level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp <ip1>:9345: connect: connection refused"
.
.
level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: <ip1>:6443"
level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [<ip2>:6443 <ip3>:6443] [default: <ip1>:6443]"
level=info msg="Removing server from load balancer rke2-agent-load-balancer: <ip1>:9345"
level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [<ip2>:9345 <ip3>:9345] [default: <ip1>:9345]"

Validation Results:

level=error msg="Remotedialer proxy error; reconnecting..." error="websocket: close 1006 (abnormal closure): unexpected EOF" url="wss://<ip1>:9345/v1-rke2/connect"
level=info msg="Closing 3 connections to load balancer server <ip1>:6443"
level=debug msg="Failed over to new server for load balancer rke2-api-server-agent-load-balancer: <ip1>:6443 -> <ip3>:6443"
level=info msg="Connecting to proxy" url="wss://<ip1>:9345/v1-rke2/connect"
level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp <ip1>:9345: connect: connection refused"
level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp <ip1>:9345: connect: connection refused" url="wss://<ip1>:9345/v1-rke2/connect"
.
.

level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: <ip1>:6443"
level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [<ip2>:6443 <ip3>:6443] [default: <ip1>:6443]"
level=info msg="Removing server from load balancer rke2-agent-load-balancer: <ip1>:9345"
level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [<ip2>:9345 <ip3>:9345] [default: <ip1>:9345]"
level=info msg="Stopped tunnel to <ip1>:9345"