rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.57k stars 268 forks source link

Losing 2 nodes in 3 node cluster seems to take down the remaining node #6182

Closed joelcomp1 closed 5 months ago

joelcomp1 commented 5 months ago

Environmental Info: RKE2 Version: rke2 version v1.29.5+rke2r1 (d8f98fd9d7f5b2a7d6a754850ba942b5247fd50b) go version go1.21.9 X:boringcrypto

Node(s) CPU architecture, OS, and Version: Linux .18.0-553.5.1.el8_10.x86_64 #1 SMP Tue May 21 03:13:04 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.10 (Ootpa) SELINUX and FIPS enabled

Cluster Configuration: 3 nodes all setup as servers (no worker only nodes)

Describe the bug: We are trying to test failover with our cluster to ensure we can run with only one server, we have found though when we only have one node and we shut off the other two nodes, the 3rd remaining node wont start un

Steps To Reproduce:

Expected behavior: After losing both nodes cluster should remain up

Actual behavior: We lose access to the k8s API if we do kubectl cmds we get: kubectl get nodes

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)

DEBU[0254] Tunnel server egress proxy dialing 10.42.1.42:10250 directly DEBU[0254] Tunnel server handing HTTP/1.1 CONNECT request for //10.42.1.42:10250 from 127.0.0.1:34532 DEBU[0254] Tunnel server egress proxy dialing 10.42.1.42:10250 directly DEBU[0254] Tunnel server handing HTTP/1.1 CONNECT request for //10.42.1.42:10250 from 127.0.0.1:34536 DEBU[0254] Tunnel server egress proxy dialing 10.42.1.42:10250 directly {"level":"warn","ts":"2024-06-11T15:46:18.768728-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ad4c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""} INFO[0255] Failed to test data store connection: context deadline exceeded INFO[0255] Waiting for etcd server to become available
INFO[0255] Waiting for API server to become available
INFO[0255] Pod for etcd not synced (pod sandbox has changed), retrying DEBU[0257] Wrote ping
{"level":"warn","ts":"2024-06-11T15:46:21.172563-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ad4c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""} ERRO[0257] Failed to check local etcd status for learner management: context deadline exceeded INFO[0257] Cluster-Http-Server 2024/06/11 15:46:21 http: TLS handshake error from 127.0.0.1:34580: remote error: tls: bad certificate INFO[0257] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error DEBU[0262] Wrote ping
INFO[0262] Cluster-Http-Server 2024/06/11 15:46:26 http: TLS handshake error from 127.0.0.1:46036: remote error: tls: bad certificate INFO[0262] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error DEBU[0267] Wrote ping
INFO[0267] Cluster-Http-Server 2024/06/11 15:46:31 http: TLS handshake error from 127.0.0.1:46076: remote error: tls: bad certificate INFO[0267] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error DEBU[0272] Wrote ping
{"level":"warn","ts":"2024-06-11T15:46:36.173061-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ad4c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""} ERRO[0272] Failed to check local etcd status for learner management: context deadline exceeded INFO[0272] Cluster-Http-Server 2024/06/11 15:46:36 http: TLS handshake error from 127.0.0.1:59026: remote error: tls: bad certificate INFO[0272] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error INFO[0275] Pod for etcd not synced (pod sandbox has changed), retrying DEBU[0277] Wrote ping
INFO[0277] Cluster-Http-Server 2024/06/11 15:46:41 http: TLS handshake error from 127.0.0.1:59042: remote error: tls: bad certificate INFO[0277] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error DEBU[0282] Wrote ping
INFO[0282] Cluster-Http-Server 2024/06/11 15:46:46 http: TLS handshake error from 127.0.0.1:51428: remote error: tls: bad certificate INFO[0282] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error DEBU[0283] Tunnel server handing HTTP/1.1 CONNECT request for //10.42.1.42:10250 from 127.0.0.1:51450 DEBU[0283] Tunnel server egress proxy dialing 10.42.1.42:10250 directly INFO[0285] Waiting for API server to become available
INFO[0285] Waiting for etcd server to become available
DEBU[0287] Wrote ping
{"level":"warn","ts":"2024-06-11T15:46:51.173241-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ad4c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""} ERRO[0287] Failed to check local etcd status for learner management: context deadline exceeded INFO[0287] Cluster-Http-Server 2024/06/11 15:46:51 http: TLS handshake error from 127.0.0.1:51458: remote error: tls: bad certificate INFO[0287] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error {"level":"warn","ts":"2024-06-11T15:46:53.770483-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ad4c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""} INFO[0290] Failed to test data store connection: context deadline exceeded DEBU[0292] Wrote ping
INFO[0292] Cluster-Http-Server 2024/06/11 15:46:56 http: TLS handshake error from 127.0.0.1:34204: remote error: tls: bad certificate INFO[0292] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error ERRO[0294] Sending HTTP 502 response to 127.0.0.1:60092: dial tcp 10.42.1.42:10250: connect: connection timed out INFO[0295] Pod for etcd not synced (pod sandbox has changed), retrying

Additional context / logs: If i run as on the one remaining node:

/usr/bin/rke2 server --debug

i see:

brandond commented 5 months ago

We are trying to test failover with our cluster to ensure we can run with only one server,

You can't. This is not how etcd or any other distributed datastore works. Your cluster needs to have a quorum of etcd nodes available at all times in order to function.

https://docs.rke2.io/install/ha

An odd number (three recommended) of server nodes that will run etcd, the Kubernetes API, and other control plane services Why An Odd Number Of Server Nodes? An etcd cluster must be comprised of an odd number of server nodes for etcd to maintain quorum. For a cluster with n servers, quorum is (n/2)+1. For any odd-sized cluster, adding one node will always increase the number of nodes necessary for quorum. Although adding a node to an odd-sized cluster appears better since there are more machines, the fault tolerance is worse. Exactly the same number of nodes can fail without losing quorum, but there are now more nodes that can fail.

See also https://etcd.io/docs/v3.5/faq/#what-is-failure-tolerance