Closed sebastienmusso closed 4 months ago
It sounds like there's a race condition somewhere in the load-balancer retry logic, and that should be fixable. If you can run the agent with debug: true
and attach the full log from journald, that would be helpful.
There is a read lock held within the faulting function, so we must have missed holding the correct lock somewhere that's modifying the randomServers array.
https://github.com/k3s-io/k3s/blob/v1.28.9%2Bk3s1/pkg/agent/loadbalancer/servers.go#L114-L135
edit: nevermind, I see that this function holds a read lock on the list, but writes to the nextServerIndex field. Multiple goroutines could increment that at the same time, after its value has been clamped to the length of the array.
I'm currently unable to reproduce this after leaving multiple server nodes powered off for 4 and then a 20 hour stretch. I can see the agent node correctly updating loadbalancer server lists without issue from the agent.
I'm not saying I don't believe this happened but that I won't be able to really reproduce and validate the fix.
$ kgn
NAME STATUS ROLES AGE VERSION
age0 Ready <none> 22h v1.28.9+rke2r1
server0 Ready control-plane,etcd,master 23h v1.28.9+rke2r1
ser1 Ready control-plane,etcd,master 22h v1.28.9+rke2r1
ser2 Ready control-plane,etcd,master 22h v1.28.9+rke2r1
$ sudo journalctl -u rke2-agent
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp server0:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp server0:9345: connect: connection refused" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp server0:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp server0:9345: connect: connection refused" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp server0:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp server0:9345: connect: connection refused" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp server0:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp server0:9345: connect: connection refused" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp server0:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp server0:9345: connect: connection refused" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser1:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser1:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: ser1:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: ser1:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Stopped tunnel to ser1:9345"
age0 rke2[1492]: level=info msg="Proxy done" err="context canceled" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser1:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser1:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Stopped tunnel to ser2:9345"
age0 rke2[1492]: level=info msg="Proxy done" err="context canceled" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Stopped tunnel to ser2:9345"
age0 rke2[1492]: level=info msg="Proxy done" err="context canceled" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="websocket: close 1006 (abnormal closure): unexpected EOF" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp ser2:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp ser2:9345: connect: connection refused" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp ser2:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp ser2:9345: connect: connection refused" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="read tcp agent0:59264->server0:9345: read: connection reset by peer" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: server0:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: server0:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Stopped tunnel to ser2:9345"
age0 rke2[1492]: level=info msg="Stopped tunnel to server0:9345"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: server0:6443"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: ser1:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: server0:9345"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: ser1:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Stopped tunnel to ser1:9345"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Proxy done" err="context canceled" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser1:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser1:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser1:9345/v1-rke2/connect"
// you can see a lot of restarts from the two nodes being down in the cluster $ kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default akri-agent-daemonset-kdb7m 1/1 Running 2 (13m ago) 23h
default akri-agent-daemonset-mrwqc 1/1 Running 0 23h
default akri-agent-daemonset-qkkfx 1/1 Running 1 (13m ago) 23h
default akri-agent-daemonset-sfnmx 1/1 Running 1 (18h ago) 23h
default akri-controller-deployment-75cfdfb45-xmnsg 1/1 Running 4 (11m ago) 23h
default akri-webhook-configuration-779bc4b7b7-jlhzc 1/1 Running 0 23h
kube-system cloud-controller-manager-localhost 1/1 Running 87 (12m ago) 21h
kube-system cloud-controller-manager-ser1 1/1 Running 5 (18h ago) 23h
kube-system cloud-controller-manager-ser2 1/1 Running 8 (12m ago) 23h
kube-system etcd-localhost 1/1 Running 1 21h
kube-system etcd-ser1 1/1 Running 2 23h
kube-system etcd-ser2 1/1 Running 4 23h
kube-system helm-install-rke2-canal-4kzqj 0/1 Completed 0 23h
kube-system helm-install-rke2-coredns-xhbbm 0/1 Completed 0 23h
kube-system helm-install-rke2-ingress-nginx-726ft 0/1 Completed 0 23h
kube-system helm-install-rke2-metrics-server-972hw 0/1 Completed 0 23h
kube-system helm-install-rke2-snapshot-controller-crd-6rkwg 0/1 Completed 0 23h
kube-system helm-install-rke2-snapshot-controller-f5slk 0/1 Completed 1 23h
kube-system helm-install-rke2-snapshot-validation-webhook-tx8vw 0/1 Completed 0 23h
kube-system kube-apiserver-localhost 1/1 Running 1 21h
kube-system kube-apiserver-ser1 1/1 Running 220 (16m ago) 23h
kube-system kube-apiserver-ser2 1/1 Running 6 23h
kube-system kube-controller-manager-localhost 1/1 Running 87 (12m ago) 21h
kube-system kube-controller-manager-ser1 1/1 Running 5 (18h ago) 23h
kube-system kube-controller-manager-ser2 1/1 Running 7 (12m ago) 23h
kube-system kube-proxy-age0 1/1 Running 0 23h
kube-system kube-proxy-localhost 1/1 Running 1 (13m ago) 21h
kube-system kube-proxy-ser1 1/1 Running 1 (18h ago) 21h
kube-system kube-proxy-ser2 1/1 Running 2 (13m ago) 21h
kube-system kube-scheduler-localhost 1/1 Running 3 (13m ago) 21h
kube-system kube-scheduler-ser1 1/1 Running 3 (18h ago) 23h
kube-system kube-scheduler-ser2 1/1 Running 3 (13m ago) 23h
kube-system rke2-canal-48ds4 2/2 Running 0 23h
kube-system rke2-canal-g74sg 2/2 Running 4 (13m ago) 23h
kube-system rke2-canal-qwzg7 2/2 Running 2 (18h ago) 23h
kube-system rke2-canal-wxnfl 2/2 Running 2 (13m ago) 23h
kube-system rke2-coredns-rke2-coredns-84b9cb946c-5zmq4 1/1 Running 0 22h
kube-system rke2-coredns-rke2-coredns-84b9cb946c-p9j7f 1/1 Running 1 (13m ago) 23h
kube-system rke2-coredns-rke2-coredns-autoscaler-b49765765-dg6xm 1/1 Running 1 (13m ago) 23h
kube-system rke2-ingress-nginx-controller-fc4jq 1/1 Running 28 (13m ago) 23h
kube-system rke2-ingress-nginx-controller-g847x 1/1 Running 2 (13m ago) 23h
kube-system rke2-ingress-nginx-controller-tbcvt 1/1 Running 0 23h
kube-system rke2-ingress-nginx-controller-xrnbk 1/1 Running 1 (18h ago) 23h
kube-system rke2-metrics-server-655477f655-h8czb 1/1 Running 33 (13m ago) 23h
kube-system rke2-snapshot-controller-59cc9cd8f4-mzkmg 1/1 Running 25 (13m ago) 23h
kube-system rke2-snapshot-validation-webhook-54c5989b65-l75c6 1/1 Running 33 (13m ago) 23h
@sebastienmusso We've elected to close this because of it's inability to be reliably reproduced. If you find that the behavior is not resolved on the releases that will be out hopefully in the next week or so please feel free to re-create an issue with what you find.
Thank you for reporting your bug!
Environmental Info: RKE2 Version: v1.28.9+rke2r1
Node(s) CPU architecture, OS, and Version: Linux 5.15.0-102-generic Ubuntu
Cluster Configuration: 3 servers 4 agent
Describe the bug: We are trying to check the behaviour of our application when control plane quorum is lost. We stopped 2 controlplane out of the 3 and wait some hours. At the beginning ingress and deployed application still available, but some hours later every connection hanged and we could not reach app within ingress. On all agent nodes we got this golang panic:
With this panic, containerd and kubelet wich are started by rke2 agent process have been killed too and container were unavailable. This message happen on all agent nodes but at different times (few hours between them) I know controlplane quorum should not be lost as long, but what if.
Steps To Reproduce: Stop 2 server nodes and let cluster work
Expected behavior: rke agent process should continue running or maybe kill himself with security message/error
Actual behavior: panic from rke agent process