Losing 2 nodes in 3 node cluster seems to take down the remaining node

Environmental Info: RKE2 Version: rke2 version v1.29.5+rke2r1 (d8f98fd9d7f5b2a7d6a754850ba942b5247fd50b) go version go1.21.9 X:boringcrypto

Node(s) CPU architecture, OS, and Version: Linux .18.0-553.5.1.el8_10.x86_64 #1 SMP Tue May 21 03:13:04 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.10 (Ootpa) SELINUX and FIPS enabled

Cluster Configuration: 3 nodes all setup as servers (no worker only nodes)

Describe the bug: We are trying to test failover with our cluster to ensure we can run with only one server, we have found though when we only have one node and we shut off the other two nodes, the 3rd remaining node wont start un

Steps To Reproduce:

Installed RKE2 on 3 nodes, cluster working well.
Take one node down, cluster continues to operate
Take second node down, can no longer access K8s API on the remaining node

Expected behavior: After losing both nodes cluster should remain up

Actual behavior: We lose access to the k8s API if we do kubectl cmds we get: kubectl get nodes

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)

DEBU[0254] Tunnel server egress proxy dialing 10.42.1.42:10250 directly DEBU[0254] Tunnel server handing HTTP/1.1 CONNECT request for //10.42.1.42:10250 from 127.0.0.1:34532 DEBU[0254] Tunnel server egress proxy dialing 10.42.1.42:10250 directly DEBU[0254] Tunnel server handing HTTP/1.1 CONNECT request for //10.42.1.42:10250 from 127.0.0.1:34536 DEBU[0254] Tunnel server egress proxy dialing 10.42.1.42:10250 directly {"level":"warn","ts":"2024-06-11T15:46:18.768728-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ad4c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""} INFO[0255] Failed to test data store connection: context deadline exceeded INFO[0255] Waiting for etcd server to become available
INFO[0255] Waiting for API server to become available
INFO[0255] Pod for etcd not synced (pod sandbox has changed), retrying DEBU[0257] Wrote ping
{"level":"warn","ts":"2024-06-11T15:46:21.172563-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ad4c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""} ERRO[0257] Failed to check local etcd status for learner management: context deadline exceeded INFO[0257] Cluster-Http-Server 2024/06/11 15:46:21 http: TLS handshake error from 127.0.0.1:34580: remote error: tls: bad certificate INFO[0257] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error DEBU[0262] Wrote ping
INFO[0262] Cluster-Http-Server 2024/06/11 15:46:26 http: TLS handshake error from 127.0.0.1:46036: remote error: tls: bad certificate INFO[0262] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error DEBU[0267] Wrote ping
INFO[0267] Cluster-Http-Server 2024/06/11 15:46:31 http: TLS handshake error from 127.0.0.1:46076: remote error: tls: bad certificate INFO[0267] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error DEBU[0272] Wrote ping
{"level":"warn","ts":"2024-06-11T15:46:36.173061-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ad4c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""} ERRO[0272] Failed to check local etcd status for learner management: context deadline exceeded INFO[0272] Cluster-Http-Server 2024/06/11 15:46:36 http: TLS handshake error from 127.0.0.1:59026: remote error: tls: bad certificate INFO[0272] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error INFO[0275] Pod for etcd not synced (pod sandbox has changed), retrying DEBU[0277] Wrote ping
INFO[0277] Cluster-Http-Server 2024/06/11 15:46:41 http: TLS handshake error from 127.0.0.1:59042: remote error: tls: bad certificate INFO[0277] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error DEBU[0282] Wrote ping
INFO[0282] Cluster-Http-Server 2024/06/11 15:46:46 http: TLS handshake error from 127.0.0.1:51428: remote error: tls: bad certificate INFO[0282] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error DEBU[0283] Tunnel server handing HTTP/1.1 CONNECT request for //10.42.1.42:10250 from 127.0.0.1:51450 DEBU[0283] Tunnel server egress proxy dialing 10.42.1.42:10250 directly INFO[0285] Waiting for API server to become available
INFO[0285] Waiting for etcd server to become available
DEBU[0287] Wrote ping
{"level":"warn","ts":"2024-06-11T15:46:51.173241-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ad4c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""} ERRO[0287] Failed to check local etcd status for learner management: context deadline exceeded INFO[0287] Cluster-Http-Server 2024/06/11 15:46:51 http: TLS handshake error from 127.0.0.1:51458: remote error: tls: bad certificate INFO[0287] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error {"level":"warn","ts":"2024-06-11T15:46:53.770483-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ad4c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""} INFO[0290] Failed to test data store connection: context deadline exceeded DEBU[0292] Wrote ping
INFO[0292] Cluster-Http-Server 2024/06/11 15:46:56 http: TLS handshake error from 127.0.0.1:34204: remote error: tls: bad certificate INFO[0292] Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error ERRO[0294] Sending HTTP 502 response to 127.0.0.1:60092: dial tcp 10.42.1.42:10250: connect: connection timed out INFO[0295] Pod for etcd not synced (pod sandbox has changed), retrying

Additional context / logs: If i run as on the one remaining node:

/usr/bin/rke2 server --debug

i see:

rancher / rke2

Losing 2 nodes in 3 node cluster seems to take down the remaining node #6182