Open alainnonga opened 2 years ago
Hi @alainnonga, thanks for the detailed info. This issue sounds similar to the bug initially reported in #9755 (further detail in #10970).
Assuming its the same issue, several fixes have already been merged and will be available in next patch releases for the currently supported versions of Consul (1.8.x - 1.10.x).
When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.
Overview of the Issue
Adding a Consul server node to a 5 cluster nodes (4 Consul servers, 1 Consul client), causes periodic loss of leader. Some time restarting the Consul agent resolves the issues, some times it does not you have to restart the Consul agents on all nodes.
Reproduction Steps
We have seen this issue from time to time in production environment. But, have not been able to reproduce it.
Consul info for both Client and Server
Client info
``` output from client 'consul info' command here check_monitors = 16 check_ttls = 0 checks = 16 services = 16 build: prerelease = revision = a82e6a7f version = 1.5.2 consul: acl = enabled known_servers = 5 server = false runtime: arch = amd64 cpu_count = 2 goroutines = 63 max_procs = 2 os = linux version = 1.12 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 4 members = 6 query_queue = 0 query_time = 1Server info
``` output from server 'consul info' command here agent: check_monitors = 16 check_ttls = 0 checks = 16 services = 16 build: prerelease = revision = a82e6a7f version = 1.5.2 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = false leader_addr = ipA:8300 server = true raft: applied_index = 8401 commit_index = 8401 fsm_pending = 0 last_contact = 47.060828ms last_log_index = 8401 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidC Address:ipD:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidF Address:ipF:8300}] latest_configuration_index = 83 num_peers = 4 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 2 runtime: arch = amd64 cpu_count = 2 goroutines = 104 max_procs = 2 os = linux version = 1.12 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 4 members = 6 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 8 members = 5 query_queue = 0 query_time = 1Operating system and Environment details
CentOS 7, amd64
Log Fragments
From the HostF added to the cluster 2021/10/06 03:08:52 [INFO] agent: (LAN) joined: 2 2021/10/06 03:08:52 [INFO] agent: Join LAN completed. Synced with 2 initial agents 2021/10/06 03:08:52 [INFO] agent: (WAN) joined: 2 2021/10/06 03:08:52 [INFO] agent: Join WAN completed. Synced with 2 initial agents 2021/10/06 03:08:52 [INFO] consul: Existing Raft peers reported by HostD, disabling bootstrap mode 2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostD (Addr: tcp/ipD:8300) (DC: dcAA) 2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostC (Addr: tcp/ipC:8300) (DC: dcAA) 2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostA (Addr: tcp/ipA:8300) (DC: dcAA) ... 2021/10/06 03:10:27 [WARN] raft: Election timeout reached, restarting election 2021/10/06 03:10:27 [INFO] raft: Node at ipF:8300 [Candidate] entering Candidate state in term 527 2021/10/06 03:10:33 [ERR] agent: Coordinate update error: No cluster leader ... 2021/10/06 03:11:01 [INFO] consul: New leader elected: HostA ... 2021/10/06 03:13:07 [INFO] raft: Node at ipF:8300 [Candidate] entering Candidate state in term 533 2021/10/06 03:13:14 [ERR] http: Request GET /v1/kv/cluster_public_addr, error: No cluster leader from=@ 2021/10/06 03:13:15 [ERR] http: Request GET /v1/kv/cluster_health_data, error: No cluster leader from=@
From HostA: cluster leader 2021/10/06 03:08:52 [INFO] serf: EventMemberJoin: HostF.dnA.net ipF 2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostF.dnA.net (Addr: tcp/ipF:8300) (DC: dcAA) 2021/10/06 03:08:52 [INFO] raft: Updating configuration with AddNonvoter (uuidF, ipF:8300) to [{Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidE Address:ipE:8300} {Suffrage:Nonvoter ID:uuidF Address:ipF:8300}] ... 2021/10/06 03:08:55 [INFO] raft: Updating configuration with RemoveServer (uuidF, ) to [{Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidE Address:ipE:8300}] 2021/10/06 03:08:55 [INFO] raft: Removed peer uuidF, stopping replication after 105626800 ... 2021/10/06 03:10:27 [WARN] raft: Rejecting vote request from ipF:8300 since we have a leader: ipA:8300 ... 2021/10/06 03:10:55 [ERR] consul: failed to reconcile member: {HostF.dnA.net ipF 8301 map[acls:1 build:1.5.2:a82e6a7f dc:dcAA expect:3 id:uuidF port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}: leadership lost while committing log
2021/10/06 03:10:55 [INFO] raft: aborting pipeline replication to peer {Voter uuidC ipC:8300} 2021/10/06 03:10:55 [INFO] consul: removing server by ID: "uuidF" ... 2021/10/06 03:10:55 [INFO] consul: cluster leadership lost 2021/10/06 03:10:57 [WARN] raft: Rejecting vote request from ipF:8300 since our last index is greater (105627279, 105627019)
2021/10/06 03:11:01 [WARN] raft: Heartbeat timeout from "" reached, starting election 2021/10/06 03:11:01 [INFO] raft: Node at ipA:8300 [Candidate] entering Candidate state in term 532 2021/10/06 03:11:01 [INFO] raft: Election won. Tally: 3 ... vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}: leadership lost while committing log 2021/10/06 03:14:01 [INFO] raft: aborting pipeline replication to peer {Voter uuidE ipE:8300} 2021/10/06 03:14:01 [INFO] consul: cluster leadership lost