dpgaspar commented 6 years ago

Description of the Issue (and unexpected/desired result)

I'm running a 3 node Consul cluster (version 0.9.3) on AWS, and having more or less 100 client nodes using consul-template.

The state of the clients change but currently are at: 104 alive clients and 86 failed

I suspect that, if a new deploy is made with say 5 to 10 new client consul nodes, and the rest of the clients can't gossip with them eg: FW miss configuration, the consul cluster may go to leader election more or less once a day.

If all nodes gossip freely with each other, then the cluster runs smoothly, although I still detect leader elections, say more or less 15 days to 15 days.

Of course this is a suspect, not a fact.

consul cluster raft_multiplier = 1 (best performance)

Performance metrics:

Consul

RPC query = 16/s Go routines is 1.5K to 2.5K System

Network recv on leader is ~ 65KB/s CPU is always under 3-4%

Reproduction steps

`consul version` for both Client and Server

Client: [client version here] Server: [server version here]

`consul info` for both Client and Server

Client:

[Client `consul info` here]

Server:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease =
    revision = 112c060
    version = 0.9.3
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    leader_addr = XX.XX.XX.179:8300
    server = true
raft:
    applied_index = 3969323
    commit_index = 3969323
    fsm_pending = 0
    last_contact = 0
    last_log_index = 3969323
    last_log_term = 546
    last_snapshot_index = 3961156
    last_snapshot_term = 546
    latest_configuration = [{Suffrage:Voter ID:XX.XX.XX.57:8300 Address:XX.XX.XX.57:8300} {Suffrage:Voter ID:XX.XX.XX.179:8300 Address:XX.XX.XX.179:8300} {Suffrage:Voter ID:XX.XX.XX.68:8300 Address:XX.XX.XX.68:8300}]
    latest_configuration_index = 2844431
    num_peers = 2
    protocol_version = 2
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 546
runtime:
    arch = amd64
    cpu_count = 2
    goroutines = 1968
    max_procs = 2
    os = linux
    version = go1.9
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 49
    failed = 83
    health_score = 0
    intent_queue = 0
    left = 1
    member_time = 6732
    members = 194
    query_queue = 0
    query_time = 277
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 204
    members = 3
    query_queue = 0
    query_time = 1

Operating system and Environment details

Amazon Linux AMI 2017.09.1.20171120 x86_64 HVM

Log Fragments or Link to gist

March 13th 2018, 06:00:53.000   serf: attempting reconnect to prod-web-apollo-worker-i-0ffca95c1403ef180 XX.XX.118.199:8301 prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 06:00:46.000   serf: attempting reconnect to prod-pool-serpentapimobile-i-0bece69b45ca523b1 XX.XX.5.236:8301   prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 06:00:33.000   serf: attempting reconnect to prod-web-apollo-worker-i-0196173eb1e74be56 XX.XX.114.250:8301 prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 06:00:13.000   serf: attempting reconnect to prod-web-apollo-worker-i-0c0e09aae21e9ee66 XX.XX.116.0:8301   prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 06:00:06.000   serf: attempting reconnect to prod-web-apollo-worker-i-0b8e15c54a422b1c6 XX.XX.110.23:8301  prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:53.000   serf: attempting reconnect to prod-web-apollo-worker-i-0df99315665cdb1a3 XX.XX.118.58:8301  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:33.000   serf: attempting reconnect to prod-web-apollo-worker-i-0bd97a9b1377b9367 XX.XX.106.163:8301 prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:26.000   serf: attempting reconnect to ip-10-120-2-94 10.120.2.94:8301   prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:22.000   XMT: Solicit on eth0, interval 123670ms.    prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:17.000   XMT: Solicit on eth0, interval 118890ms.    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:15.000   raft: Failed to make RequestVote RPC to {Voter XX.XX.140.68:8300 XX.XX.140.68:8300}: read tcp XX.XX.136.179:59262->XX.XX.140.68:8300: i/o timeout   prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:13.000   serf: attempting reconnect to prod-web-apollo-worker-i-08d0f070ffd9d6858 XX.XX.106.24:8301  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:09.000   memberlist: Failed fallback ping: write tcp XX.XX.140.68:54600->XX.XX.5.171:8301: i/o timeout   prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:09.000   memberlist: Suspect prod-pool-amf-i-01844805652ab3e4b has failed, no acks received  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:08.000   consul.coordinate: Batch update failed: node is not the leader  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:08.000   memberlist: Refuting a suspect message (from: stag-pool-serpentapimobile-i-0610d577866ee05f9)   prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:07.000   consul: New leader elected: prod-consul-cluster-i-03a8cb1fbfc7a1f47 prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:07.000   raft: Election timeout reached, restarting election prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Node at XX.XX.136.179:8300 [Candidate] entering Candidate state in term 531   prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Election won. Tally: 2    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Node at XX.XX.136.179:8300 [Leader] entering Leader state prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Added peer XX.XX.138.57:8300, starting replication    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Added peer XX.XX.140.68:8300, starting replication    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   consul: cluster leadership acquired prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   consul: New leader elected: prod-consul-cluster-i-03a8cb1fbfc7a1f47 prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: pipelining replication to peer {Voter XX.XX.138.57:8300 XX.XX.138.57:8300}    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: pipelining replication to peer {Voter XX.XX.140.68:8300 XX.XX.140.68:8300}    prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:07.000   raft: Node at XX.XX.138.57:8300 [Follower] entering Follower state (Leader: "") prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:07.000   consul: New leader elected: prod-consul-cluster-i-03a8cb1fbfc7a1f47 prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:06.000   raft: Duplicate RequestVote for same term: 530  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:06.000   raft: Duplicate RequestVote for same term: 530  prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:06.000   raft: Node at XX.XX.138.57:8300 [Candidate] entering Candidate state in term 530    prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:06.000   raft: Heartbeat timeout from "XX.XX.140.68:8300" reached, starting election prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:05.000   raft: Failed to contact quorum of nodes, stepping down  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: Node at XX.XX.136.179:8300 [Candidate] entering Candidate state in term 530   prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:05.000   raft: Failed to contact XX.XX.138.57:8300 in 500.242555ms   prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: Heartbeat timeout from "XX.XX.140.68:8300" reached, starting election prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:59:05.000   raft: Failed to contact XX.XX.136.179:8300 in 506.424996ms  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: Rejecting vote request from XX.XX.136.179:8300 since we have a leader: XX.XX.140.68:8300  prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:59:05.000   raft: Node at XX.XX.140.68:8300 [Follower] entering Follower state (Leader: "") prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   consul: cluster leadership lost prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: aborting pipeline replication to peer {Voter XX.XX.136.179:8300 XX.XX.136.179:8300}   prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: aborting pipeline replication to peer {Voter XX.XX.138.57:8300 XX.XX.138.57:8300} prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:05.000   raft: Failed to contact XX.XX.138.57:8300 in 510.645137ms   prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:59:02.000   XMT: Solicit on eth0, interval 111630ms.    prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:58:53.000   serf: attempting reconnect to prod-web-apollo-worker-i-03038eff987ffa5c2 XX.XX.117.26:8301  prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:58:46.000   serf: attempting reconnect to prod-pool-serpentapimobilelogins-i-0a65c7a2c295d27dd XX.XX.3.11:8301  prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:58:33.000   serf: attempting reconnect to prod-web-apollo-worker-i-0e05ea44df6a59684 XX.XX.116.143:8301 prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:58:13.000   serf: attempting reconnect to prod-website-amf-i-02c0bb5ec9c9ec0e3 XX.XX.5.215:8301 prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:58:06.000   serf: attempting reconnect to prod-web-apollo-worker-i-02296fe5e42da86ef XX.XX.115.241:8301 prod-consul-cluster-i-0b9de6b437bc1ceff
March 13th 2018, 05:57:53.000   serf: attempting reconnect to prod-web-apollo-worker-i-01b25f006f88f4d17 XX.XX.108.61:8301  prod-consul-cluster-i-00b054a92fffe2703
March 13th 2018, 05:57:33.000   serf: attempting reconnect to prod-web-apollo-worker-i-06b6a3fb15e893e95 XX.XX.109.79:8301  prod-consul-cluster-i-03a8cb1fbfc7a1f47
March 13th 2018, 05:57:26.000   serf: attempting reconnect to prod-web-apollo-worker-i-0e05c52b33634fd37 XX.XX.111.63:8301

dpgaspar commented 6 years ago

Metrics we have during the event, not always the leader changes, we detect this using serf new leader event metric:

pearkes commented 6 years ago

Just wanted to note this was opened after discussion with @banks on the Consul mailing list here.

pearkes commented 6 years ago

I don't suggest we've resolved this issue but I'm going to close it for now. Without a tight reproduction we'll have trouble looking into this seriously, but it will be useful to keep in the index here if we see this or similar issues come up again.

hashicorp / consul

Consul repeated leader elections #3967

Description of the Issue (and unexpected/desired result)

Performance metrics:

Consul

RPC query = 16/s Go routines is 1.5K to 2.5K System

Reproduction steps

`consul version` for both Client and Server

`consul info` for both Client and Server

Operating system and Environment details

Log Fragments or Link to gist

hashicorp / consul

Consul repeated leader elections #3967

Description of the Issue (and unexpected/desired result)

Performance metrics:

Consul

RPC query = 16/s Go routines is 1.5K to 2.5K System

Reproduction steps

consul version for both Client and Server

consul info for both Client and Server

Operating system and Environment details

Log Fragments or Link to gist

`consul version` for both Client and Server

`consul info` for both Client and Server