hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.25k stars 4.41k forks source link

Joining a Consul server node to a 5 node cluster causes periodic loss of leader #11355

Open alainnonga opened 2 years ago

alainnonga commented 2 years ago

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

Adding a Consul server node to a 5 cluster nodes (4 Consul servers, 1 Consul client), causes periodic loss of leader. Some time restarting the Consul agent resolves the issues, some times it does not you have to restart the Consul agents on all nodes.

Reproduction Steps

We have seen this issue from time to time in production environment. But, have not been able to reproduce it.

  1. Create a cluster with 5 server nodes and 1 client node with autopilot CleanupDeadServers set to false
  2. Remove 1 server node with (consul leave) from the cluster for maintenance
  3. About 12 hours later, add the node back to the cluster (consul join)

Consul info for both Client and Server

Client info ``` output from client 'consul info' command here check_monitors = 16 check_ttls = 0 checks = 16 services = 16 build: prerelease = revision = a82e6a7f version = 1.5.2 consul: acl = enabled known_servers = 5 server = false runtime: arch = amd64 cpu_count = 2 goroutines = 63 max_procs = 2 os = linux version = 1.12 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 4 members = 6 query_queue = 0 query_time = 1
Server info ``` output from server 'consul info' command here agent: check_monitors = 16 check_ttls = 0 checks = 16 services = 16 build: prerelease = revision = a82e6a7f version = 1.5.2 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = false leader_addr = ipA:8300 server = true raft: applied_index = 8401 commit_index = 8401 fsm_pending = 0 last_contact = 47.060828ms last_log_index = 8401 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidC Address:ipD:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidF Address:ipF:8300}] latest_configuration_index = 83 num_peers = 4 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 2 runtime: arch = amd64 cpu_count = 2 goroutines = 104 max_procs = 2 os = linux version = 1.12 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 4 members = 6 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 8 members = 5 query_queue = 0 query_time = 1

Operating system and Environment details

CentOS 7, amd64

Log Fragments

From the HostF added to the cluster 2021/10/06 03:08:52 [INFO] agent: (LAN) joined: 2 2021/10/06 03:08:52 [INFO] agent: Join LAN completed. Synced with 2 initial agents 2021/10/06 03:08:52 [INFO] agent: (WAN) joined: 2 2021/10/06 03:08:52 [INFO] agent: Join WAN completed. Synced with 2 initial agents 2021/10/06 03:08:52 [INFO] consul: Existing Raft peers reported by HostD, disabling bootstrap mode 2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostD (Addr: tcp/ipD:8300) (DC: dcAA) 2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostC (Addr: tcp/ipC:8300) (DC: dcAA) 2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostA (Addr: tcp/ipA:8300) (DC: dcAA) ... 2021/10/06 03:10:27 [WARN] raft: Election timeout reached, restarting election 2021/10/06 03:10:27 [INFO] raft: Node at ipF:8300 [Candidate] entering Candidate state in term 527 2021/10/06 03:10:33 [ERR] agent: Coordinate update error: No cluster leader ... 2021/10/06 03:11:01 [INFO] consul: New leader elected: HostA ... 2021/10/06 03:13:07 [INFO] raft: Node at ipF:8300 [Candidate] entering Candidate state in term 533 2021/10/06 03:13:14 [ERR] http: Request GET /v1/kv/cluster_public_addr, error: No cluster leader from=@ 2021/10/06 03:13:15 [ERR] http: Request GET /v1/kv/cluster_health_data, error: No cluster leader from=@

From HostA: cluster leader 2021/10/06 03:08:52 [INFO] serf: EventMemberJoin: HostF.dnA.net ipF 2021/10/06 03:08:52 [INFO] consul: Adding LAN server HostF.dnA.net (Addr: tcp/ipF:8300) (DC: dcAA) 2021/10/06 03:08:52 [INFO] raft: Updating configuration with AddNonvoter (uuidF, ipF:8300) to [{Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidE Address:ipE:8300} {Suffrage:Nonvoter ID:uuidF Address:ipF:8300}] ... 2021/10/06 03:08:55 [INFO] raft: Updating configuration with RemoveServer (uuidF, ) to [{Suffrage:Voter ID:uuidC Address:ipC:8300} {Suffrage:Voter ID:uuidA Address:ipA:8300} {Suffrage:Voter ID:uuidD Address:ipD:8300} {Suffrage:Voter ID:uuidE Address:ipE:8300}] 2021/10/06 03:08:55 [INFO] raft: Removed peer uuidF, stopping replication after 105626800 ... 2021/10/06 03:10:27 [WARN] raft: Rejecting vote request from ipF:8300 since we have a leader: ipA:8300 ... 2021/10/06 03:10:55 [ERR] consul: failed to reconcile member: {HostF.dnA.net ipF 8301 map[acls:1 build:1.5.2:a82e6a7f dc:dcAA expect:3 id:uuidF port:8300 raft_vsn:3 role:consul segment: use_tls:1 vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}: leadership lost while committing log

2021/10/06 03:10:55 [INFO] raft: aborting pipeline replication to peer {Voter uuidC ipC:8300} 2021/10/06 03:10:55 [INFO] consul: removing server by ID: "uuidF" ... 2021/10/06 03:10:55 [INFO] consul: cluster leadership lost 2021/10/06 03:10:57 [WARN] raft: Rejecting vote request from ipF:8300 since our last index is greater (105627279, 105627019)

2021/10/06 03:11:01 [WARN] raft: Heartbeat timeout from "" reached, starting election 2021/10/06 03:11:01 [INFO] raft: Node at ipA:8300 [Candidate] entering Candidate state in term 532 2021/10/06 03:11:01 [INFO] raft: Election won. Tally: 3 ... vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}: leadership lost while committing log 2021/10/06 03:14:01 [INFO] raft: aborting pipeline replication to peer {Voter uuidE ipE:8300} 2021/10/06 03:14:01 [INFO] consul: cluster leadership lost

blake commented 2 years ago

Hi @alainnonga, thanks for the detailed info. This issue sounds similar to the bug initially reported in #9755 (further detail in #10970).

Assuming its the same issue, several fixes have already been merged and will be available in next patch releases for the currently supported versions of Consul (1.8.x - 1.10.x).