hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.41k stars 4.43k forks source link

Consul repeated leader elections #3967

Closed dpgaspar closed 6 years ago

dpgaspar commented 6 years ago

Description of the Issue (and unexpected/desired result)

I'm running a 3 node Consul cluster (version 0.9.3) on AWS, and having more or less 100 client nodes using consul-template.

The state of the clients change but currently are at: 104 alive clients and 86 failed

I suspect that, if a new deploy is made with say 5 to 10 new client consul nodes, and the rest of the clients can't gossip with them eg: FW miss configuration, the consul cluster may go to leader election more or less once a day.

If all nodes gossip freely with each other, then the cluster runs smoothly, although I still detect leader elections, say more or less 15 days to 15 days.

Of course this is a suspect, not a fact.

consul cluster raft_multiplier = 1 (best performance)

Performance metrics:

Consul

RPC query = 16/s Go routines is 1.5K to 2.5K System

Network recv on leader is ~ 65KB/s CPU is always under 3-4%

Reproduction steps

consul version for both Client and Server

Client: [client version here] Server: [server version here]

consul info for both Client and Server

Client:

[Client `consul info` here]

Server:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease =
    revision = 112c060
    version = 0.9.3
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    leader_addr = XX.XX.XX.179:8300
    server = true
raft:
    applied_index = 3969323
    commit_index = 3969323
    fsm_pending = 0
    last_contact = 0
    last_log_index = 3969323
    last_log_term = 546
    last_snapshot_index = 3961156
    last_snapshot_term = 546
    latest_configuration = [{Suffrage:Voter ID:XX.XX.XX.57:8300 Address:XX.XX.XX.57:8300} {Suffrage:Voter ID:XX.XX.XX.179:8300 Address:XX.XX.XX.179:8300} {Suffrage:Voter ID:XX.XX.XX.68:8300 Address:XX.XX.XX.68:8300}]
    latest_configuration_index = 2844431
    num_peers = 2
    protocol_version = 2
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 546
runtime:
    arch = amd64
    cpu_count = 2
    goroutines = 1968
    max_procs = 2
    os = linux
    version = go1.9
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 49
    failed = 83
    health_score = 0
    intent_queue = 0
    left = 1
    member_time = 6732
    members = 194
    query_queue = 0
    query_time = 277
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 204
    members = 3
    query_queue = 0
    query_time = 1

Operating system and Environment details

Amazon Linux AMI 2017.09.1.20171120 x86_64 HVM

Log Fragments or Link to gist

March 13th 2018, 06:00:53.000   serf: attempting reconnect to prod-web-apollo-worker-i-0ffca95c1403ef180 XX.XX.118.199:8301 prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 06:00:46.000   serf: attempting reconnect to prod-pool-serpentapimobile-i-0bece69b45ca523b1 XX.XX.5.236:8301   prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 06:00:33.000   serf: attempting reconnect to prod-web-apollo-worker-i-0196173eb1e74be56 XX.XX.114.250:8301 prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 06:00:13.000   serf: attempting reconnect to prod-web-apollo-worker-i-0c0e09aae21e9ee66 XX.XX.116.0:8301   prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 06:00:06.000   serf: attempting reconnect to prod-web-apollo-worker-i-0b8e15c54a422b1c6 XX.XX.110.23:8301  prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:53.000   serf: attempting reconnect to prod-web-apollo-worker-i-0df99315665cdb1a3 XX.XX.118.58:8301  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:33.000   serf: attempting reconnect to prod-web-apollo-worker-i-0bd97a9b1377b9367 XX.XX.106.163:8301 prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:26.000   serf: attempting reconnect to ip-10-120-2-94 10.120.2.94:8301   prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:22.000   XMT: Solicit on eth0, interval 123670ms.    prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:17.000   XMT: Solicit on eth0, interval 118890ms.    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:15.000   raft: Failed to make RequestVote RPC to {Voter XX.XX.140.68:8300 XX.XX.140.68:8300}: read tcp XX.XX.136.179:59262->XX.XX.140.68:8300: i/o timeout   prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:13.000   serf: attempting reconnect to prod-web-apollo-worker-i-08d0f070ffd9d6858 XX.XX.106.24:8301  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:09.000   memberlist: Failed fallback ping: write tcp XX.XX.140.68:54600->XX.XX.5.171:8301: i/o timeout   prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:09.000   memberlist: Suspect prod-pool-amf-i-01844805652ab3e4b has failed, no acks received  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:08.000   consul.coordinate: Batch update failed: node is not the leader  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:08.000   memberlist: Refuting a suspect message (from: stag-pool-serpentapimobile-i-0610d577866ee05f9)   prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:07.000   consul: New leader elected: prod-consul-cluster-i-03a8cb1fbfc7a1f47 prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:07.000   raft: Election timeout reached, restarting election prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Node at XX.XX.136.179:8300 [Candidate] entering Candidate state in term 531   prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Election won. Tally: 2    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Node at XX.XX.136.179:8300 [Leader] entering Leader state prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Added peer XX.XX.138.57:8300, starting replication    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Added peer XX.XX.140.68:8300, starting replication    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   consul: cluster leadership acquired prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   consul: New leader elected: prod-consul-cluster-i-03a8cb1fbfc7a1f47 prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: pipelining replication to peer {Voter XX.XX.138.57:8300 XX.XX.138.57:8300}    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: pipelining replication to peer {Voter XX.XX.140.68:8300 XX.XX.140.68:8300}    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Node at XX.XX.138.57:8300 [Follower] entering Follower state (Leader: "") prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:07.000   consul: New leader elected: prod-consul-cluster-i-03a8cb1fbfc7a1f47 prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:06.000   raft: Duplicate RequestVote for same term: 530  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:06.000   raft: Duplicate RequestVote for same term: 530  prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:06.000   raft: Node at XX.XX.138.57:8300 [Candidate] entering Candidate state in term 530    prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:06.000   raft: Heartbeat timeout from "XX.XX.140.68:8300" reached, starting election prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:05.000   raft: Failed to contact quorum of nodes, stepping down  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: Node at XX.XX.136.179:8300 [Candidate] entering Candidate state in term 530   prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:05.000   raft: Failed to contact XX.XX.138.57:8300 in 500.242555ms   prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: Heartbeat timeout from "XX.XX.140.68:8300" reached, starting election prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:05.000   raft: Failed to contact XX.XX.136.179:8300 in 506.424996ms  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: Rejecting vote request from XX.XX.136.179:8300 since we have a leader: XX.XX.140.68:8300  prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:05.000   raft: Node at XX.XX.140.68:8300 [Follower] entering Follower state (Leader: "") prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   consul: cluster leadership lost prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: aborting pipeline replication to peer {Voter XX.XX.136.179:8300 XX.XX.136.179:8300}   prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: aborting pipeline replication to peer {Voter XX.XX.138.57:8300 XX.XX.138.57:8300} prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: Failed to contact XX.XX.138.57:8300 in 510.645137ms   prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:02.000   XMT: Solicit on eth0, interval 111630ms.    prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:58:53.000   serf: attempting reconnect to prod-web-apollo-worker-i-03038eff987ffa5c2 XX.XX.117.26:8301  prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:58:46.000   serf: attempting reconnect to prod-pool-serpentapimobilelogins-i-0a65c7a2c295d27dd XX.XX.3.11:8301  prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:58:33.000   serf: attempting reconnect to prod-web-apollo-worker-i-0e05ea44df6a59684 XX.XX.116.143:8301 prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:58:13.000   serf: attempting reconnect to prod-website-amf-i-02c0bb5ec9c9ec0e3 XX.XX.5.215:8301 prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:58:06.000   serf: attempting reconnect to prod-web-apollo-worker-i-02296fe5e42da86ef XX.XX.115.241:8301 prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:57:53.000   serf: attempting reconnect to prod-web-apollo-worker-i-01b25f006f88f4d17 XX.XX.108.61:8301  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:57:33.000   serf: attempting reconnect to prod-web-apollo-worker-i-06b6a3fb15e893e95 XX.XX.109.79:8301  prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:57:26.000   serf: attempting reconnect to prod-web-apollo-worker-i-0e05c52b33634fd37 XX.XX.111.63:8301
dpgaspar commented 6 years ago

Metrics we have during the event, not always the leader changes, we detect this using serf new leader event metric:

screen shot 2018-03-15 at 18 10 29 screen shot 2018-03-15 at 18 16 38
pearkes commented 6 years ago

Just wanted to note this was opened after discussion with @banks on the Consul mailing list here.

pearkes commented 6 years ago

I don't suggest we've resolved this issue but I'm going to close it for now. Without a tight reproduction we'll have trouble looking into this seriously, but it will be useful to keep in the index here if we see this or similar issues come up again.