hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.24k stars 4.41k forks source link

consul server become to nonvoter and fails to select a leader #20045

Open chenxing0407 opened 8 months ago

chenxing0407 commented 8 months ago

we have three servers node1, node2, node3 version is [cloud@node2 ~]$ consul version Consul v1.9.2 Revision 6530cf370 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

config file looks like , the difference is the ip addr and name. {
"addresses": { "http": "10.10.10.11" },
"autopilot": { "cleanup_dead_servers": false },
"bind_addr": "10.10.10.11", "bootstrap_expect": 3, "data_dir": "/tmp/consul_storage", "datacenter": "storage", "enable_script_checks": true, "log_level": "INFO", "node_name": "node1", "ports": { "dns": -1 }, "reconnect_timeout": "8760h", "retry_join": [ "10.10.10.14", "10.10.10.17" ], "server": true, }

when boot, it fails to choose a leader, this is the first time i see .

using consul monitor --log-level=debug -http-addr=10.10.10.17:8500 log continue looks like

2023-12-22T15:49:38.878+0800 [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader= 2023-12-22T15:49:38.878+0800 [INFO] agent.server.raft: entering candidate state: node="Node at 10.10.10.17:8300 [Candidate]" term=47991 2023-12-22T15:49:38.879+0800 [DEBUG] agent.server.raft: votes: needed=2 2023-12-22T15:49:38.879+0800 [WARN] agent.server.raft: unable to get address for server, using fallback address: id=19a68ea7-ac67-a776-3c0f-c6197cbfba7f fallback=10.10.10.17:8300 error="Could not find address for server id 19a68ea7-ac67-a776-3c0f-c6197cbfba7f" 2023-12-22T15:49:38.879+0800 [WARN] agent.server.raft: unable to get address for server, using fallback address: id=54c79a65-1d3c-ad25-ed60-0e200bf0d283 fallback=10.10.10.14:8300 error="Could not find address for server id 54c79a65-1d3c-ad25-ed60-0e200bf0d283" 2023-12-22T15:49:38.881+0800 [DEBUG] agent.server.raft: vote granted: from=19a68ea7-ac67-a776-3c0f-c6197cbfba7f term=47991 tally=1 2023-12-22T15:49:38.882+0800 [DEBUG] agent.server.raft: vote granted: from=54c79a65-1d3c-ad25-ed60-0e200bf0d283 term=47991 tally=2 2023-12-22T15:49:38.882+0800 [INFO] agent.server.raft: election won: tally=2 2023-12-22T15:49:38.882+0800 [INFO] agent.server.raft: entering leader state: leader="Node at 10.10.10.17:8300 [Leader]" 2023-12-22T15:49:38.882+0800 [INFO] agent.server.raft: added peer, starting replication: peer=19a68ea7-ac67-a776-3c0f-c6197cbfba7f 2023-12-22T15:49:38.882+0800 [INFO] agent.server.raft: added peer, starting replication: peer=54c79a65-1d3c-ad25-ed60-0e200bf0d283 2023-12-22T15:49:38.882+0800 [INFO] agent.server.raft: added peer, starting replication: peer=e4d66e88-1359-fa07-672b-a75211472427 2023-12-22T15:49:38.882+0800 [INFO] agent.server: New leader elected: payload=node2 2023-12-22T15:49:38.882+0800 [WARN] agent.server.raft: unable to get address for server, using fallback address: id=19a68ea7-ac67-a776-3c0f-c6197cbfba7f fallback=10.10.10.17:8300 error="Could not find address for server id 19a68ea7-ac67-a776-3c0f-c6197cbfba7f" 2023-12-22T15:49:38.882+0800 [INFO] agent.server: cluster leadership acquired 2023-12-22T15:49:38.882+0800 [WARN] agent.server.raft: unable to get address for server, using fallback address: id=54c79a65-1d3c-ad25-ed60-0e200bf0d283 fallback=10.10.10.14:8300 error="Could not find address for server id 54c79a65-1d3c-ad25-ed60-0e200bf0d283" 2023-12-22T15:49:38.883+0800 [WARN] agent.server.raft: unable to get address for server, using fallback address: id=54c79a65-1d3c-ad25-ed60-0e200bf0d283 fallback=10.10.10.14:8300 error="Could not find address for server id 54c79a65-1d3c-ad25-ed60-0e200bf0d283" 2023-12-22T15:49:38.883+0800 [INFO] agent.server.raft: pipelining replication: peer="{Voter 54c79a65-1d3c-ad25-ed60-0e200bf0d283 10.10.10.14:8300}" 2023-12-22T15:49:38.884+0800 [INFO] agent.server.raft: pipelining replication: peer="{Nonvoter e4d66e88-1359-fa07-672b-a75211472427 10.10.10.11:8300}" 2023-12-22T15:49:38.885+0800 [INFO] agent.server.raft: entering follower state: follower="Node at 10.10.10.17:8300 [Follower]" leader= 2023-12-22T15:49:38.885+0800 [DEBUG] agent.server: shutting down leader loop 2023-12-22T15:49:38.885+0800 [INFO] agent.server.raft: aborting pipeline replication: peer="{Voter 54c79a65-1d3c-ad25-ed60-0e200bf0d283 10.10.10.14:8300}" 2023-12-22T15:49:38.885+0800 [INFO] agent.server.raft: aborting pipeline replication: peer="{Nonvoter e4d66e88-1359-fa07-672b-a75211472427 10.10.10.11:8300}" 2023-12-22T15:49:38.885+0800 [ERROR] agent.server: failed to wait for barrier: error="leadership lost while committing log" 2023-12-22T15:49:38.885+0800 [INFO] agent.server: cluster leadership lost 2023-12-22T15:49:39.039+0800 [DEBUG] agent.server.serf.lan: serf: messageUserEventType: consul:new-leader 2023-12-22T15:49:39.092+0800 [DEBUG] agent.server.serf.lan: serf: messageUserEventType: consul:new-leader 2023-12-22T15:49:39.103+0800 [DEBUG] agent.server.serf.lan: serf: messageUserEventType: consul:new-leader 2023-12-22T15:49:39.677+0800 [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=10.10.10.17:32892 latency=38.52µs 2023-12-22T15:49:40.106+0800 [WARN] agent: Syncing node info failed.: error="No cluster leader"

My question is why 10.10.10.11 become to Nonvoter , Does this relate to the reason of not select a leader

chenxing0407 commented 8 months ago

when first start, see logs like 2023-12-25T16:11:58.852+0800 [ERROR] agent.server.raft: failed to decode incoming command: error="read tcp 10.10.10.11:8300->10.10.10.14:38071: read: connection reset by peer" 2023-12-25T16:11:58.852+0800 [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error making call: EOF"

may relate to network traffic, see drpi in atop drpi 5