Closed dpgaspar closed 6 years ago
Metrics we have during the event, not always the leader changes, we detect this using serf new leader event metric:
Just wanted to note this was opened after discussion with @banks on the Consul mailing list here.
I don't suggest we've resolved this issue but I'm going to close it for now. Without a tight reproduction we'll have trouble looking into this seriously, but it will be useful to keep in the index here if we see this or similar issues come up again.
Description of the Issue (and unexpected/desired result)
I'm running a 3 node Consul cluster (version 0.9.3) on AWS, and having more or less 100 client nodes using consul-template.
The state of the clients change but currently are at: 104 alive clients and 86 failed
I suspect that, if a new deploy is made with say 5 to 10 new client consul nodes, and the rest of the clients can't gossip with them eg: FW miss configuration, the consul cluster may go to leader election more or less once a day.
If all nodes gossip freely with each other, then the cluster runs smoothly, although I still detect leader elections, say more or less 15 days to 15 days.
Of course this is a suspect, not a fact.
consul cluster raft_multiplier = 1 (best performance)
Performance metrics:
Consul
RPC query = 16/s Go routines is 1.5K to 2.5K System
Network recv on leader is ~ 65KB/s CPU is always under 3-4%
Reproduction steps
consul version
for both Client and ServerClient:
[client version here]
Server:[server version here]
consul info
for both Client and ServerClient:
Server:
Operating system and Environment details
Amazon Linux AMI 2017.09.1.20171120 x86_64 HVM
Log Fragments or Link to gist