hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.4k stars 4.43k forks source link

Consul Leadership Lost Storm (Consul 1.4.0) #5064

Closed orarnon closed 5 years ago

orarnon commented 5 years ago

Overview of the Issue

Consul cluster performs multiple leader elections in bursts.

Consul info for both Client and Server

Server info ``` agent: check_monitors = 0 check_ttls = 5 checks = 5 services = 6 build: prerelease = revision = 48d287ef version = 1.2.3 consul: bootstrap = false known_datacenters = 0 leader = false leader_addr = 10.200.10.157:8300 server = true raft: applied_index = 458981771 commit_index = 458981771 fsm_pending = 0 last_contact = 12.827875ms last_log_index = 458981771 last_log_term = 81772 last_snapshot_index = 458971913 last_snapshot_term = 81768 latest_configuration = [{Suffrage:Voter ID:673e0ff3-815a-4c56-0b5b-1ccf17d7a7d1 Address:10.200.1.154:8300} {Suffrage:Voter ID:082aa7eb-3692-0d7b-b4fe-7efd1ab5b42e Address:10.200.4.70:8300} {Suffrage:Voter ID:6568e22c-2660-bdb8-8d47-5ffc65c1601d Address:10.200.10.157:8300}] latest_configuration_index = 400654632 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 81772 runtime: arch = amd64 cpu_count = 8 goroutines = 24996 max_procs = 4 os = linux version = go1.10.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 14909 failed = 0 health_score = 0 intent_queue = 0 left = 1057 member_time = 2142504 members = 2401 query_queue = 0 query_time = 245 ```

Operating system and Environment details

root@consul1:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.5 LTS Release: 14.04 Codename: trusty

Log Fragments

Out from consul monitor -log-level=debug | grep -i leader 2018/12/06 13:30:18 [DEBUG] http: Request GET /v1/status/leader (104.366µs) from=127.0.0.1:33502 2018/12/06 13:30:27 [DEBUG] http: Request GET /v1/status/leader (176.165µs) from=127.0.0.1:33542 2018/12/06 13:30:31 [INFO] raft: Node at 10.200.10.157:8300 [Follower] entering Follower state (Leader: "") 2018/12/06 13:30:31 [DEBUG] consul: shutting down leader loop 2018/12/06 13:30:31 [ERR] consul.kvs: Apply failed: leadership lost while committing log 2018/12/06 13:30:31 [ERR] consul: failed to reconcile member: {sspjs-i-081da2aa3e0f81dd3.sonic-us 10.200.1.38 8301 map[build:1.2.3:48d287ef role:node dc:sonic-us_us-east-1 segment: id:e27b279e-9f99-7e90-1921-f1117e0f4a15 vsn:2 vsn_min:2 vsn_max:3] left 1 5 2 2 5 4}: leadership lost while committing log 2018/12/06 13:30:31 [ERR] consul.kvs: Apply failed: leadership lost while committing log 2018/12/06 13:30:31 [ERR] consul.kvs: Apply failed: leadership lost while committing log 2018/12/06 13:30:31 [ERR] consul.kvs: Apply failed: leadership lost while committing log 2018/12/06 13:30:31 [ERR] consul.kvs: Apply failed: leadership lost while committing log 2018/12/06 13:30:31 [INFO] consul: cluster leadership lost 2018/12/06 13:30:32 [INFO] raft: Node at 10.200.10.157:8300 [Leader] entering Leader state 2018/12/06 13:30:32 [INFO] consul: cluster leadership acquired 2018/12/06 13:30:32 [INFO] consul: New leader elected: consul3.sonic-us.us-east-1 2018/12/06 13:30:33 [DEBUG] http: Request GET /v1/status/leader (291.761µs) from=127.0.0.1:33674 2018/12/06 13:30:33 [INFO] raft: Node at 10.200.10.157:8300 [Follower] entering Follower state (Leader: "") 2018/12/06 13:30:33 [DEBUG] consul: shutting down leader loop 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [INFO] consul: cluster leadership lost 2018/12/06 13:30:33 [WARN] consul.coordinate: Batch update failed: node is not the leader 2018/12/06 13:30:33 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:33 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:33 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:33 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:33 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:35 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:35 [INFO] consul: New leader elected: consul1.sonic-us.us-east-1 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:42 [DEBUG] http: Request GET /v1/status/leader (122.805µs) from=127.0.0.1:33708 2018/12/06 13:30:42 [DEBUG] http: Request GET /v1/status/leader (98.895µs) from=127.0.0.1:33712 2018/12/06 13:30:57 [DEBUG] http: Request GET /v1/status/leader (151.047µs) from=127.0.0.1:33752 2018/12/06 13:30:57 [DEBUG] http: Request GET /v1/status/leader (143.795µs) from=127.0.0.1:33756 2018/12/06 13:31:01 [DEBUG] http: Request GET /v1/status/leader?stale (34.443µs) from=127.0.0.1:33776
orarnon commented 5 years ago

We are reaching high Raft commitTime: image Raft apply count: image RPC count: image Runtime heap objects count: image

Also: root@consul1:~# consul members | grep left | wc -l 1061

We can provide more metrics

orarnon commented 5 years ago

After Upgrading to Consul 1.4.0, we see RPC calls pattern change: image

pearkes commented 5 years ago

Hey @orarnon what version are you upgrading from? I'd also graph if you can the leadership stats so we can see the leader elections directly.

You may need to tune the raft_multiplier if leader elections are happening frequently, depending on your environment. I'd take a look at that. Unfortunately without knowing a lot more it is hard to tell what that should be but perhaps you can read a bit there and consider based on your setup.

orarnon commented 5 years ago

Hi @pearkes , thanks for answering. We had frequent leader changes at Consul 1.2.3 and we've upgraded to 1.4.0 as a first step. Now, we have modified raft_multiplier to 3 and it stopped. However, we can't find any reason why it would happen in the first place. Our CPU, Memory and network are not saturated by any means.

pearkes commented 5 years ago

Now, we have modified raft_multiplier to 3 and it stopped.

Good to hear!

Saturation is part of it -- latency also plays a large role, so if your network topology changed that may affect this as well.

It doesn't sound like there's specifically a bug here, and more that the environment your caused the default value for the raft_multiplier to be incorrect. See the docs there for the intentional trade-off we make in optimizing for what we feel is the common case. I'm going to close this but feel free to open a new issue if you feel there might be something else going on.