Closed orarnon closed 5 years ago
We are reaching high Raft commitTime: Raft apply count: RPC count: Runtime heap objects count:
Also: root@consul1:~# consul members | grep left | wc -l 1061
We can provide more metrics
After Upgrading to Consul 1.4.0, we see RPC calls pattern change:
Hey @orarnon what version are you upgrading from? I'd also graph if you can the leadership stats so we can see the leader elections directly.
You may need to tune the raft_multiplier
if leader elections are happening frequently, depending on your environment. I'd take a look at that. Unfortunately without knowing a lot more it is hard to tell what that should be but perhaps you can read a bit there and consider based on your setup.
Hi @pearkes , thanks for answering. We had frequent leader changes at Consul 1.2.3 and we've upgraded to 1.4.0 as a first step. Now, we have modified raft_multiplier to 3 and it stopped. However, we can't find any reason why it would happen in the first place. Our CPU, Memory and network are not saturated by any means.
Now, we have modified raft_multiplier to 3 and it stopped.
Good to hear!
Saturation is part of it -- latency also plays a large role, so if your network topology changed that may affect this as well.
It doesn't sound like there's specifically a bug here, and more that the environment your caused the default value for the raft_multiplier to be incorrect. See the docs there for the intentional trade-off we make in optimizing for what we feel is the common case. I'm going to close this but feel free to open a new issue if you feel there might be something else going on.
Overview of the Issue
Consul cluster performs multiple leader elections in bursts.
Consul info for both Client and Server
Server info
``` agent: check_monitors = 0 check_ttls = 5 checks = 5 services = 6 build: prerelease = revision = 48d287ef version = 1.2.3 consul: bootstrap = false known_datacenters = 0 leader = false leader_addr = 10.200.10.157:8300 server = true raft: applied_index = 458981771 commit_index = 458981771 fsm_pending = 0 last_contact = 12.827875ms last_log_index = 458981771 last_log_term = 81772 last_snapshot_index = 458971913 last_snapshot_term = 81768 latest_configuration = [{Suffrage:Voter ID:673e0ff3-815a-4c56-0b5b-1ccf17d7a7d1 Address:10.200.1.154:8300} {Suffrage:Voter ID:082aa7eb-3692-0d7b-b4fe-7efd1ab5b42e Address:10.200.4.70:8300} {Suffrage:Voter ID:6568e22c-2660-bdb8-8d47-5ffc65c1601d Address:10.200.10.157:8300}] latest_configuration_index = 400654632 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 81772 runtime: arch = amd64 cpu_count = 8 goroutines = 24996 max_procs = 4 os = linux version = go1.10.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 14909 failed = 0 health_score = 0 intent_queue = 0 left = 1057 member_time = 2142504 members = 2401 query_queue = 0 query_time = 245 ```Operating system and Environment details
root@consul1:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.5 LTS Release: 14.04 Codename: trusty
Log Fragments
Out from consul monitor -log-level=debug | grep -i leader
2018/12/06 13:30:18 [DEBUG] http: Request GET /v1/status/leader (104.366µs) from=127.0.0.1:33502 2018/12/06 13:30:27 [DEBUG] http: Request GET /v1/status/leader (176.165µs) from=127.0.0.1:33542 2018/12/06 13:30:31 [INFO] raft: Node at 10.200.10.157:8300 [Follower] entering Follower state (Leader: "") 2018/12/06 13:30:31 [DEBUG] consul: shutting down leader loop 2018/12/06 13:30:31 [ERR] consul.kvs: Apply failed: leadership lost while committing log 2018/12/06 13:30:31 [ERR] consul: failed to reconcile member: {sspjs-i-081da2aa3e0f81dd3.sonic-us 10.200.1.38 8301 map[build:1.2.3:48d287ef role:node dc:sonic-us_us-east-1 segment: id:e27b279e-9f99-7e90-1921-f1117e0f4a15 vsn:2 vsn_min:2 vsn_max:3] left 1 5 2 2 5 4}: leadership lost while committing log 2018/12/06 13:30:31 [ERR] consul.kvs: Apply failed: leadership lost while committing log 2018/12/06 13:30:31 [ERR] consul.kvs: Apply failed: leadership lost while committing log 2018/12/06 13:30:31 [ERR] consul.kvs: Apply failed: leadership lost while committing log 2018/12/06 13:30:31 [ERR] consul.kvs: Apply failed: leadership lost while committing log 2018/12/06 13:30:31 [INFO] consul: cluster leadership lost 2018/12/06 13:30:32 [INFO] raft: Node at 10.200.10.157:8300 [Leader] entering Leader state 2018/12/06 13:30:32 [INFO] consul: cluster leadership acquired 2018/12/06 13:30:32 [INFO] consul: New leader elected: consul3.sonic-us.us-east-1 2018/12/06 13:30:33 [DEBUG] http: Request GET /v1/status/leader (291.761µs) from=127.0.0.1:33674 2018/12/06 13:30:33 [INFO] raft: Node at 10.200.10.157:8300 [Follower] entering Follower state (Leader: "") 2018/12/06 13:30:33 [DEBUG] consul: shutting down leader loop 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [ERR] consul.kvs: Apply failed: node is not the leader 2018/12/06 13:30:33 [INFO] consul: cluster leadership lost 2018/12/06 13:30:33 [WARN] consul.coordinate: Batch update failed: node is not the leader 2018/12/06 13:30:33 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:33 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:33 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:33 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:33 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:34 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:35 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:35 [INFO] consul: New leader elected: consul1.sonic-us.us-east-1 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:36 [DEBUG] serf: messageUserEventType: consul:new-leader 2018/12/06 13:30:42 [DEBUG] http: Request GET /v1/status/leader (122.805µs) from=127.0.0.1:33708 2018/12/06 13:30:42 [DEBUG] http: Request GET /v1/status/leader (98.895µs) from=127.0.0.1:33712 2018/12/06 13:30:57 [DEBUG] http: Request GET /v1/status/leader (151.047µs) from=127.0.0.1:33752 2018/12/06 13:30:57 [DEBUG] http: Request GET /v1/status/leader (143.795µs) from=127.0.0.1:33756 2018/12/06 13:31:01 [DEBUG] http: Request GET /v1/status/leader?stale (34.443µs) from=127.0.0.1:33776