rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.53k stars 266 forks source link

Failed to get MemberList from server #6872

Closed cloudcafetech closed 1 day ago

cloudcafetech commented 1 day ago

Environmental Info:

rke2 -v
rke2 version v1.30.4+rke2r1 (9517eea519b780e154dd791c555c698e84a0e5cd)
go version go1.22.5 X:boringcrypto

Node(s) CPU architecture, OS, and Version: Linux 5.14.0-427.33.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 16 10:56:24 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: MASTER1:

token: 225mgm-secret
write-kubeconfig-mode: "0644"
cluster-cidr: 10.244.0.0/14
service-cidr: 192.168.0.0/16
node-label:
- "region=master"
tls-san:
  - "172.27.2.209"
  - "172.27.2.219"
  - "172.27.2.223"
  - "172.27.2.225"
# SELINUX
selinux: true

MASTER2:

server: https://172.27.2.209:9345
token: 225mgm-secret
write-kubeconfig-mode: "0644"
cluster-cidr: 10.244.0.0/14
service-cidr: 192.168.0.0/16
node-label:
- "region=master"
tls-san:
  - "172.27.2.209"
  - "172.27.2.219"
  - "172.27.2.223"
  - "172.27.2.225"
# SELINUX
selinux: true

Able to reach Master on 9345

# nc -v 172.27.2.209 9345
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to 172.27.2.209:9345.
^C

Describe the bug:

Not able to join master, error coming as below

Sep 26 04:59:10 test rke2[2493800]: {"level":"warn","ts":"2024-09-26T04:59:10.746954+0200","logger":"etcd-client","caller":"v3@v3.5.13-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0011981e0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 26 04:59:10 test rke2[2493800]: time="2024-09-26T04:59:10+02:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Sep 26 04:59:11 test rke2[2493800]: time="2024-09-26T04:59:11+02:00" level=info msg="Waiting to retrieve etcd cluster member list: failed to get MemberList from server: Internal error occurred: failed to get etcd MemberList: context deadline exceeded"
Sep 26 04:59:13 test rke2[2493800]: time="2024-09-26T04:59:13+02:00" level=info msg="Waiting to retrieve etcd cluster member list: failed to get MemberList from server: Internal error occurred: failed to get etcd MemberList: context deadline exceeded"
Sep 26 04:59:15 test rke2[2493800]: time="2024-09-26T04:59:15+02:00" level=info msg="Waiting to retrieve etcd cluster member list: failed to get MemberList from server: Internal error occurred: failed to get etcd MemberList: context deadline exceeded"
Sep 26 04:59:16 test rke2[2493800]: {"level":"warn","ts":"2024-09-26T04:59:16.73584+0200","logger":"etcd-client","caller":"v3@v3.5.13-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0011981e0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 26 04:59:16 test rke2[2493800]: time="2024-09-26T04:59:16+02:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 26 04:59:17 test rke2[2493800]: time="2024-09-26T04:59:17+02:00" level=info msg="Waiting to retrieve etcd cluster member list: failed to get MemberList from server: Internal error occurred: failed to get etcd MemberList: context deadline exceeded"
# journalctl -f -u rke2-server | grep "failed to get MemberList"
Sep 26 05:39:10 test rke2[2589133]: time="2024-09-26T05:39:10+02:00" level=info msg="Waiting to retrieve etcd cluster member list: failed to get MemberList from server: Internal error occurred: failed to get etcd MemberList: context deadline exceeded"
Sep 26 05:39:12 test rke2[2589133]: time="2024-09-26T05:39:12+02:00" level=info msg="Waiting to retrieve etcd cluster member list: failed to get MemberList from server: Internal error occurred: failed to get etcd MemberList: context deadline exceeded"
Sep 26 05:39:14 test rke2[2589133]: time="2024-09-26T05:39:14+02:00" level=info msg="Waiting to retrieve etcd cluster member list: failed to get MemberList from server: Internal error occurred: failed to get etcd MemberList: context deadline exceeded"
Sep 26 05:39:16 test rke2[2589133]: time="2024-09-26T05:39:16+02:00" level=info msg="Waiting to retrieve etcd cluster member list: failed to get MemberList from server: Internal error occurred: failed to get etcd MemberList: context deadline exceeded"
brandond commented 1 day ago

Figure out why the etcd pod isn't running on the existing server node. Have you checked the kubelet.log, and etcd pod logs under /var/log/pods?

Note, RKE2 does not have "master" nodes. Just server and agents.

cloudcafetech commented 1 day ago

Looks like bug/issue was in v1.30.4, after changing new version (v1.30.5) issue fixed.

By the way, do you have any idea how to find right stable (bug/issue free) version ?

brandond commented 1 day ago

I'm not aware of any issues with etcd in v1.30.4+rke2r1, so it is unlikely that whatever was going on was resolved by the change in version. Without logs I really can't say though.

In general I'd recommend the latest version available. If we're aware of a bug, we either fix it, or call it out in the release notes.

cloudcafetech commented 1 day ago

Thank you for reply.

Based on your issue (https://github.com/rancher/rke2/issues/5804), I decided to check another version & it works.

Biggest challenge is is we setting up in Airgap Env :)

Once again thanks for prompt reply.