rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.55k stars 268 forks source link

RKE2 agent PANIC when controlplane quorum is lost since several hours #6130

Closed sebastienmusso closed 4 months ago

sebastienmusso commented 4 months ago

Environmental Info: RKE2 Version: v1.28.9+rke2r1

Node(s) CPU architecture, OS, and Version: Linux 5.15.0-102-generic Ubuntu

Cluster Configuration: 3 servers 4 agent

Describe the bug: We are trying to check the behaviour of our application when control plane quorum is lost. We stopped 2 controlplane out of the 3 and wait some hours. At the beginning ingress and deployed application still available, but some hours later every connection hanged and we could not reach app within ingress. On all agent nodes we got this golang panic:

panic: runtime error: index out of range [2] with length 2
goroutine 57316 [running]:
github.com/k3s-io/k3s/pkg/agent/loadbalancer.(*LoadBalancer).nextServer(0xc000ece0e0, {0xc000709020?, 0xc000709020?})
#011/go/pkg/mod/github.com/k3s-io/k3s@v1.28.9-rc1.0.20240423023449-289a1a3edbc0/pkg/agent/loadbalancer/servers.go:131 +0x2c5
github.com/k3s-io/k3s/pkg/agent/loadbalancer.(*LoadBalancer).dialContext(0xc000ece0e0, {0x46a46a8, 0xc0007dc1c0?}, {0x3e1b300, 0x3}, {0x0?, 0x2540be400?})
#011/go/pkg/mod/github.com/k3s-io/k3s@v1.28.9-rc1.0.20240423023449-289a1a3edbc0/pkg/agent/loadbalancer/loadbalancer.go:176 +0x2f3
inet.af/tcpproxy.(*DialProxy).HandleConn(0xc000f2e500, {0x46b7e10, 0xc000a83478})
#011/go/pkg/mod/inet.af/tcpproxy@v0.0.0-20200125044825-b6bb9b5b8252/tcpproxy.go:359 +0xff
inet.af/tcpproxy.(*Proxy).serveConn(0x4480dc?, {0x46b7e10?, 0xc000a83478}, {0xc000eda780, 0x1, 0x0?})
#011/go/pkg/mod/inet.af/tcpproxy@v0.0.0-20200125044825-b6bb9b5b8252/tcpproxy.go:239 +0x30b
created by inet.af/tcpproxy.(*Proxy).serveListener in goroutine 281
#011/go/pkg/mod/inet.af/tcpproxy@v0.0.0-20200125044825-b6bb9b5b8252/tcpproxy.go:221 +0x45

With this panic, containerd and kubelet wich are started by rke2 agent process have been killed too and container were unavailable. This message happen on all agent nodes but at different times (few hours between them) I know controlplane quorum should not be lost as long, but what if.

Steps To Reproduce: Stop 2 server nodes and let cluster work

Expected behavior: rke agent process should continue running or maybe kill himself with security message/error

Actual behavior: panic from rke agent process

brandond commented 4 months ago

It sounds like there's a race condition somewhere in the load-balancer retry logic, and that should be fixable. If you can run the agent with debug: true and attach the full log from journald, that would be helpful.

brandond commented 4 months ago

There is a read lock held within the faulting function, so we must have missed holding the correct lock somewhere that's modifying the randomServers array.

https://github.com/k3s-io/k3s/blob/v1.28.9%2Bk3s1/pkg/agent/loadbalancer/servers.go#L114-L135

edit: nevermind, I see that this function holds a read lock on the list, but writes to the nextServerIndex field. Multiple goroutines could increment that at the same time, after its value has been clamped to the length of the array.

VestigeJ commented 4 months ago

I'm currently unable to reproduce this after leaving multiple server nodes powered off for 4 and then a 20 hour stretch. I can see the agent node correctly updating loadbalancer server lists without issue from the agent.

I'm not saying I don't believe this happened but that I won't be able to really reproduce and validate the fix.

$ kgn

NAME        STATUS   ROLES                       AGE   VERSION
age0        Ready    <none>                      22h   v1.28.9+rke2r1
server0     Ready    control-plane,etcd,master   23h   v1.28.9+rke2r1
ser1        Ready    control-plane,etcd,master   22h   v1.28.9+rke2r1
ser2        Ready    control-plane,etcd,master   22h   v1.28.9+rke2r1

$ sudo journalctl -u rke2-agent

age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp server0:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp server0:9345: connect: connection refused" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp server0:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp server0:9345: connect: connection refused" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp server0:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp server0:9345: connect: connection refused" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp server0:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp server0:9345: connect: connection refused" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp server0:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp server0:9345: connect: connection refused" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser1:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser1:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: ser1:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: ser1:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Stopped tunnel to ser1:9345"
age0 rke2[1492]: level=info msg="Proxy done" err="context canceled" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser1:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser1:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Stopped tunnel to ser2:9345"
age0 rke2[1492]: level=info msg="Proxy done" err="context canceled" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Stopped tunnel to ser2:9345"
age0 rke2[1492]: level=info msg="Proxy done" err="context canceled" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="websocket: close 1006 (abnormal closure): unexpected EOF" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp ser2:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp ser2:9345: connect: connection refused" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp ser2:9345: connect: connection refused"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp ser2:9345: connect: connection refused" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=error msg="Remotedialer proxy error; reconecting..." error="read tcp agent0:59264->server0:9345: read: connection reset by peer" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: server0:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: server0:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Stopped tunnel to ser2:9345"
age0 rke2[1492]: level=info msg="Stopped tunnel to server0:9345"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser2:6443"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: server0:6443"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-api-server-agent-load-balancer: ser1:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser2:9345"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: server0:9345"
age0 rke2[1492]: level=info msg="Removing server from load balancer rke2-agent-load-balancer: ser1:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Stopped tunnel to ser1:9345"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Proxy done" err="context canceled" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://server0:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser2:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: ser1:6443"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [ser1:6443 ser2:6443 server0:6443] [default: server0:6443]"
age0 rke2[1492]: level=info msg="Adding server to load balancer rke2-agent-load-balancer: ser1:9345"
age0 rke2[1492]: level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [ser1:9345 ser2:9345 server0:9345] [default: ser2:9345]"
age0 rke2[1492]: level=info msg="Connecting to proxy" url="wss://ser1:9345/v1-rke2/connect"
age0 rke2[1492]: level=info msg="Remotedialer connected to proxy" url="wss://ser1:9345/v1-rke2/connect"

// you can see a lot of restarts from the two nodes being down in the cluster $ kubectl get po -A

NAMESPACE     NAME                                                   READY   STATUS      RESTARTS        AGE
default       akri-agent-daemonset-kdb7m                             1/1     Running     2 (13m ago)     23h
default       akri-agent-daemonset-mrwqc                             1/1     Running     0               23h
default       akri-agent-daemonset-qkkfx                             1/1     Running     1 (13m ago)     23h
default       akri-agent-daemonset-sfnmx                             1/1     Running     1 (18h ago)     23h
default       akri-controller-deployment-75cfdfb45-xmnsg             1/1     Running     4 (11m ago)     23h
default       akri-webhook-configuration-779bc4b7b7-jlhzc            1/1     Running     0               23h
kube-system   cloud-controller-manager-localhost                     1/1     Running     87 (12m ago)    21h
kube-system   cloud-controller-manager-ser1                          1/1     Running     5 (18h ago)     23h
kube-system   cloud-controller-manager-ser2                          1/1     Running     8 (12m ago)     23h
kube-system   etcd-localhost                                         1/1     Running     1               21h
kube-system   etcd-ser1                                              1/1     Running     2               23h
kube-system   etcd-ser2                                              1/1     Running     4               23h
kube-system   helm-install-rke2-canal-4kzqj                          0/1     Completed   0               23h
kube-system   helm-install-rke2-coredns-xhbbm                        0/1     Completed   0               23h
kube-system   helm-install-rke2-ingress-nginx-726ft                  0/1     Completed   0               23h
kube-system   helm-install-rke2-metrics-server-972hw                 0/1     Completed   0               23h
kube-system   helm-install-rke2-snapshot-controller-crd-6rkwg        0/1     Completed   0               23h
kube-system   helm-install-rke2-snapshot-controller-f5slk            0/1     Completed   1               23h
kube-system   helm-install-rke2-snapshot-validation-webhook-tx8vw    0/1     Completed   0               23h
kube-system   kube-apiserver-localhost                               1/1     Running     1               21h
kube-system   kube-apiserver-ser1                                    1/1     Running     220 (16m ago)   23h
kube-system   kube-apiserver-ser2                                    1/1     Running     6               23h
kube-system   kube-controller-manager-localhost                      1/1     Running     87 (12m ago)    21h
kube-system   kube-controller-manager-ser1                           1/1     Running     5 (18h ago)     23h
kube-system   kube-controller-manager-ser2                           1/1     Running     7 (12m ago)     23h
kube-system   kube-proxy-age0                                        1/1     Running     0               23h
kube-system   kube-proxy-localhost                                   1/1     Running     1 (13m ago)     21h
kube-system   kube-proxy-ser1                                        1/1     Running     1 (18h ago)     21h
kube-system   kube-proxy-ser2                                        1/1     Running     2 (13m ago)     21h
kube-system   kube-scheduler-localhost                               1/1     Running     3 (13m ago)     21h
kube-system   kube-scheduler-ser1                                    1/1     Running     3 (18h ago)     23h
kube-system   kube-scheduler-ser2                                    1/1     Running     3 (13m ago)     23h
kube-system   rke2-canal-48ds4                                       2/2     Running     0               23h
kube-system   rke2-canal-g74sg                                       2/2     Running     4 (13m ago)     23h
kube-system   rke2-canal-qwzg7                                       2/2     Running     2 (18h ago)     23h
kube-system   rke2-canal-wxnfl                                       2/2     Running     2 (13m ago)     23h
kube-system   rke2-coredns-rke2-coredns-84b9cb946c-5zmq4             1/1     Running     0               22h
kube-system   rke2-coredns-rke2-coredns-84b9cb946c-p9j7f             1/1     Running     1 (13m ago)     23h
kube-system   rke2-coredns-rke2-coredns-autoscaler-b49765765-dg6xm   1/1     Running     1 (13m ago)     23h
kube-system   rke2-ingress-nginx-controller-fc4jq                    1/1     Running     28 (13m ago)    23h
kube-system   rke2-ingress-nginx-controller-g847x                    1/1     Running     2 (13m ago)     23h
kube-system   rke2-ingress-nginx-controller-tbcvt                    1/1     Running     0               23h
kube-system   rke2-ingress-nginx-controller-xrnbk                    1/1     Running     1 (18h ago)     23h
kube-system   rke2-metrics-server-655477f655-h8czb                   1/1     Running     33 (13m ago)    23h
kube-system   rke2-snapshot-controller-59cc9cd8f4-mzkmg              1/1     Running     25 (13m ago)    23h
kube-system   rke2-snapshot-validation-webhook-54c5989b65-l75c6      1/1     Running     33 (13m ago)    23h
VestigeJ commented 4 months ago

@sebastienmusso We've elected to close this because of it's inability to be reliably reproduced. If you find that the behavior is not resolved on the releases that will be out hopefully in the next week or so please feel free to re-create an issue with what you find.

Thank you for reporting your bug!