Closed Jam-Lin closed 7 years ago
I'm using v0.6.4 and experiencing raft: Failed to contact
constantly on one of my nodes. Surly, this is a problem not related to consul but it would be nice to define own values.
+1 for this. We need to configure these values. While making backup in specific times, due to load we are getting latency >500ms. And then consul cluster breaks apart. We need at least 3000 milisec in our situation.
+1 We are severely affected by this issue. Is there a plan to allow access of the above parameters via the consul.conf file ? Thanks :)
+1 on this as well. Any updates on this?
We are planning on getting the ability to tune this in the upcoming 0.7 release.
@slackpad this is great! Do you have any idea when we might get a first drop of this, ideally as a release, but if not then as a master commit we can at least test/build off? Thanks :)
Hi @nickithewatt this should land in the next few weeks in master and shortly after that as a release candidate build.
Hi @slackpad - did this change make it into the 0.7 release? I've seen the new performance config in 0.7, but even with the defaults we're still seeing timeouts when a consul server is destroyed and replaced with a new node in AWS. Server nodes are t2.medium
s, but even then, we get timeouts which trigger leader election.
@madAndroid this change made it in to 0.7 - the multiplier directly scales the leader lease timeout. So you are replacing a non-leader node and getting an election?
Yes, but the heartbeats were timing out with the default... I've since
adjusted the raft_multiplier
to 7
, and this appears to have improved
things somewhat... still testing atm, will report back on progress. Thanks
for taking the time to respond
On 28 Sep 2016 5:42 pm, "James Phillips" notifications@github.com wrote:
@madAndroid https://github.com/madAndroid this change made it in to 0.7
- the multiplier directly scales the leader lease timeout. So you are replacing a non-leader node and getting an election?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/1969#issuecomment-250206631, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQp3QMigdW4ykoeowBV0YzTDFMFjg0Vks5quorxgaJpZM4ILnA8 .
@slackpad - We're still seeing the same behaviour, I'm afraid :(
We have 3 server nodes, one per AZ .. and we've now bumped these up to m3.medium
's, and have set raft_multiplier
to 8 .... when we randomly killl one of the servers, ASG event kicks in, and replaces that box ... and when it comes back up, it attempts to join the rest of the cluster (at this point we still have a leader, since the quorum of 2 ensure that) ... it joins at first ... then we see heartbeat timeouts, and a leader election triggered on the node that attempts to join. The remaining two nodes reject the election request, and this continues
10.206.52.43
-- Logs begin at Fri 2016-09-30 05:31:54 UTC, end at Fri 2016-09-30 06:10:29 UTC. --
Sep 30 06:07:23 core-dev-aeuw1-bastion-0ac9dc4fac99598b2 consul[17609]: 2016/09/30 06:07:23 [INFO] raft: Node at 10.206.52.43:8300 [Candidate] entering Candidate state in term 38
Sep 30 06:07:23 core-dev-aeuw1-bastion-0ac9dc4fac99598b2 consul[17609]: 2016/09/30 06:07:23 [INFO] raft: Node at 10.206.52.43:8300 [Follower] entering Follower state (Leader: "")
Sep 30 06:07:23 core-dev-aeuw1-bastion-0ac9dc4fac99598b2 consul[17609]: 2016/09/30 06:07:23 [INFO] consul: New leader elected: core-dev-aeuw1-bastion-02ed79fed6b05f028
Sep 30 06:08:36 core-dev-aeuw1-bastion-0ac9dc4fac99598b2 consul[17609]: 2016/09/30 06:08:36 [WARN] raft: Rejecting vote request from 10.206.54.87:8300 since we have a leader: 10.206.53.108:8300
Sep 30 06:08:48 core-dev-aeuw1-bastion-0ac9dc4fac99598b2 consul[17609]: 2016/09/30 06:08:48 [WARN] raft: Rejecting vote request from 10.206.54.87:8300 since we have a leader: 10.206.53.108:8300
Sep 30 06:09:03 core-dev-aeuw1-bastion-0ac9dc4fac99598b2 consul[17609]: 2016/09/30 06:09:03 [WARN] raft: Rejecting vote request from 10.206.54.87:8300 since we have a leader: 10.206.53.108:8300
Sep 30 06:09:18 core-dev-aeuw1-bastion-0ac9dc4fac99598b2 consul[17609]: 2016/09/30 06:09:18 [WARN] raft: Rejecting vote request from 10.206.54.87:8300 since we have a leader: 10.206.53.108:8300
Sep 30 06:09:30 core-dev-aeuw1-bastion-0ac9dc4fac99598b2 consul[17609]: 2016/09/30 06:09:30 [WARN] raft: Rejecting vote request from 10.206.54.87:8300 since we have a leader: 10.206.53.108:8300
Sep 30 06:09:31 core-dev-aeuw1-bastion-0ac9dc4fac99598b2 consul[17609]: 2016/09/30 06:09:31 [INFO] consul: New leader elected: core-dev-aeuw1-bastion-02ed79fed6b05f028
Sep 30 06:09:50 core-dev-aeuw1-bastion-0ac9dc4fac99598b2 consul[17609]: 2016/09/30 06:09:50 [INFO] serf: attempting reconnect to core-dev-aeuw1-bastion-00f0795cdae957ae0.core-dev 10.206.54.87:8302
10.206.53.108
-- Logs begin at Fri 2016-09-30 05:29:55 UTC, end at Fri 2016-09-30 06:10:29 UTC. --
Sep 30 06:09:31 core-dev-aeuw1-bastion-02ed79fed6b05f028 consul[16307]: 2016/09/30 06:09:31 [INFO] consul: cluster leadership acquired
Sep 30 06:09:31 core-dev-aeuw1-bastion-02ed79fed6b05f028 consul[16307]: 2016/09/30 06:09:31 [INFO] consul: New leader elected: core-dev-aeuw1-bastion-02ed79fed6b05f028
Sep 30 06:09:31 core-dev-aeuw1-bastion-02ed79fed6b05f028 consul[16307]: 2016/09/30 06:09:31 [INFO] raft: pipelining replication to peer {Voter 10.206.52.43:8300 10.206.52.43:8300}
Sep 30 06:09:31 core-dev-aeuw1-bastion-02ed79fed6b05f028 consul[16307]: 2016/09/30 06:09:31 [INFO] raft: Updating configuration with RemoveServer (10.206.54.87:8300, ) to [{Suffrage:Voter ID:10.206.53.108:8300 Address:10.206.53.108:8300} {Suffrage:Voter ID:10.206.52.43:8300 Address:10.206.52.43:8300}]
Sep 30 06:09:31 core-dev-aeuw1-bastion-02ed79fed6b05f028 consul[16307]: 2016/09/30 06:09:31 [INFO] raft: Removed peer 10.206.54.87:8300, stopping replication after 336
Sep 30 06:09:31 core-dev-aeuw1-bastion-02ed79fed6b05f028 consul[16307]: 2016/09/30 06:09:31 [INFO] raft: Updating configuration with AddStaging (10.206.54.87:8300, 10.206.54.87:8300) to [{Suffrage:Voter ID:10.206.53.108:8300 Address:10.206.53.108:8300} {Suffrage:Voter ID:10.206.52.43:8300 Address:10.206.52.43:8300} {Suffrage:Voter ID:10.206.54.87:8300 Address:10.206.54.87:8300}]
Sep 30 06:09:31 core-dev-aeuw1-bastion-02ed79fed6b05f028 consul[16307]: 2016/09/30 06:09:31 [INFO] raft: Added peer 10.206.54.87:8300, starting replication
Sep 30 06:09:31 core-dev-aeuw1-bastion-02ed79fed6b05f028 consul[16307]: 2016/09/30 06:09:31 [WARN] raft: AppendEntries to {Voter 10.206.54.87:8300 10.206.54.87:8300} rejected, sending older logs (next: 321)
Sep 30 06:09:31 core-dev-aeuw1-bastion-02ed79fed6b05f028 consul[16307]: 2016/09/30 06:09:31 [WARN] raft: AppendEntries to {Voter 10.206.54.87:8300 10.206.54.87:8300} rejected, sending older logs (next: 321)
Sep 30 06:09:31 core-dev-aeuw1-bastion-02ed79fed6b05f028 consul[16307]: 2016/09/30 06:09:31 [INFO] raft: pipelining replication to peer {Voter 10.206.54.87:8300 10.206.54.87:8300}
10.206.54.87
-- Logs begin at Fri 2016-09-30 05:50:21 UTC, end at Fri 2016-09-30 06:10:30 UTC. --
Sep 30 06:09:07 core-dev-aeuw1-bastion-0c56b6ef4856f6e25 consul[14417]: 2016/09/30 06:09:07 [ERR] http: Request GET /v1/health/state/any, error: No cluster leader from=10.206.54.87:53746
Sep 30 06:09:18 core-dev-aeuw1-bastion-0c56b6ef4856f6e25 consul[14417]: 2016/09/30 06:09:18 [WARN] raft: Election timeout reached, restarting election
Sep 30 06:09:18 core-dev-aeuw1-bastion-0c56b6ef4856f6e25 consul[14417]: 2016/09/30 06:09:18 [INFO] raft: Node at 10.206.54.87:8300 [Candidate] entering Candidate state in term 48
Sep 30 06:09:28 core-dev-aeuw1-bastion-0c56b6ef4856f6e25 consul[14417]: 2016/09/30 06:09:28 [ERR] agent: coordinate update error: No cluster leader
Sep 30 06:09:30 core-dev-aeuw1-bastion-0c56b6ef4856f6e25 consul[14417]: 2016/09/30 06:09:30 [WARN] raft: Election timeout reached, restarting election
Sep 30 06:09:30 core-dev-aeuw1-bastion-0c56b6ef4856f6e25 consul[14417]: 2016/09/30 06:09:30 [INFO] raft: Node at 10.206.54.87:8300 [Candidate] entering Candidate state in term 49
Sep 30 06:09:31 core-dev-aeuw1-bastion-0c56b6ef4856f6e25 consul[14417]: 2016/09/30 06:09:31 [INFO] raft: Node at 10.206.54.87:8300 [Follower] entering Follower state (Leader: "")
Sep 30 06:09:31 core-dev-aeuw1-bastion-0c56b6ef4856f6e25 consul[14417]: 2016/09/30 06:09:31 [WARN] raft: Failed to get previous log: 333 log not found (last: 320)
Sep 30 06:09:31 core-dev-aeuw1-bastion-0c56b6ef4856f6e25 consul[14417]: 2016/09/30 06:09:31 [WARN] raft: Failed to get previous log: 337 log not found (last: 320)
Sep 30 06:09:31 core-dev-aeuw1-bastion-0c56b6ef4856f6e25 consul[14417]: 2016/09/30 06:09:31 [INFO] consul: New leader elected: core-dev-aeuw1-bastion-02ed79fed6b05f028
Config:
{
"bind_addr": "10.206.52.43",
"client_addr": "0.0.0.0",
"data_dir": "/opt/consul",
"datacenter": "core-dev",
"dns_config": {
"allow_stale": true,
"max_stale": "5s"
},
"leave_on_terminate": false,
"log_level": "INFO",
"node_name": "core-dev-aeuw1-bastion-0ac9dc4fac99598b2",
"performance": {
"raft_multiplier": 8
},
"retry_join": [
"bastion0",
"bastion1",
"bastion2"
],
"retry_join_wan": [
"disco.infradev.core"
],
"server": true,
"ui_dir": "/opt/consul/ui"
}
When the leader election is flapping we see this when running consul operator raft -list-peers
10.206.52.43
Node ID Address State Voter
core-dev-aeuw1-bastion-02ed79fed6b05f028 10.206.53.108:8300 10.206.53.108:8300 leader true
core-dev-aeuw1-bastion-0ac9dc4fac99598b2 10.206.52.43:8300 10.206.52.43:8300 follower true
10.206.53.108
Node ID Address State Voter
core-dev-aeuw1-bastion-02ed79fed6b05f028 10.206.53.108:8300 10.206.53.108:8300 leader true
core-dev-aeuw1-bastion-0ac9dc4fac99598b2 10.206.52.43:8300 10.206.52.43:8300 follower true
10.206.54.87
Operator "raft" subcommand failed: Unexpected response code: 500 (No cluster leader)
@madAndroid the logs make it look like .87 joined the cluster - does it eventually become stable or does it repeat that with an election timeout shortly after?
@slackpad - I'll need to do some more testing, but most recently I've seen that the node that joins will enter this flapping state for some time; during this time, the instance will appear to connect, logs indicate that it joins.. then the heartbeat fails, and it reenters candidate mode; the other two nodes don't enter that same state: they reject leader election requests from the node that enters candidate state with messages saying "Rejecting vote request from 10.206.54.87:8300 since we have a leader". The cluster does eventually stabilise, after an extended period of time - not sure exactly how long this continues for, but during this time anything that's using Consul for discovery and/or DNS endpoints, is in an unknown/unstable state.
I'll do some more testing and post more findings here
@madAndroid ok thanks. Also 8 is really slow for the Raft multiplier, so those timings are likely why things take so long to stabilize. We tuned 5 with a t2.micro out of CPU credits, so if you are running on a m3.medium that should be fine. Lower numbers are for better performance.
Okay, we're seeing the same behaviour
Server name: bastion0.infradev.core.example.com
10.206.48.109
-- Logs begin at Thu 2016-10-27 09:32:58 UTC, end at Thu 2016-10-27 09:57:40 UTC. --
Oct 27 09:57:09 core-infradev-aeuw1-bastion-019d82fb82b7a2281 consul[6268]: 2016/10/27 09:57:09 [WARN] raft: Election timeout reached, restarting election
Oct 27 09:57:09 core-infradev-aeuw1-bastion-019d82fb82b7a2281 consul[6268]: 2016/10/27 09:57:09 [INFO] raft: Node at 10.206.48.109:8300 [Candidate] entering Candidate state in term 280
Oct 27 09:57:16 core-infradev-aeuw1-bastion-019d82fb82b7a2281 consul[6268]: 2016/10/27 09:57:16 [WARN] raft: Election timeout reached, restarting election
Oct 27 09:57:16 core-infradev-aeuw1-bastion-019d82fb82b7a2281 consul[6268]: 2016/10/27 09:57:16 [INFO] raft: Node at 10.206.48.109:8300 [Candidate] entering Candidate state in term 281
Oct 27 09:57:22 core-infradev-aeuw1-bastion-019d82fb82b7a2281 consul[6268]: 2016/10/27 09:57:22 [INFO] raft: Node at 10.206.48.109:8300 [Follower] entering Follower state (Leader: "")
Oct 27 09:57:22 core-infradev-aeuw1-bastion-019d82fb82b7a2281 consul[6268]: 2016/10/27 09:57:22 [WARN] raft: Failed to get previous log: 13981 log not found (last: 13970)
Oct 27 09:57:22 core-infradev-aeuw1-bastion-019d82fb82b7a2281 consul[6268]: 2016/10/27 09:57:22 [INFO] consul: New leader elected: core-infradev-aeuw1-bastion-006d7aa0c6cb46d89
Oct 27 09:57:22 core-infradev-aeuw1-bastion-019d82fb82b7a2281 consul[6268]: 2016/10/27 09:57:22 [WARN] raft: Failed to get previous log: 13986 log not found (last: 13985)
Oct 27 09:57:29 core-infradev-aeuw1-bastion-019d82fb82b7a2281 consul[6268]: 2016/10/27 09:57:29 [WARN] raft: Rejecting vote request from 10.206.49.29:8300 since we have a leader: 10.206.50.87:8300
Oct 27 09:57:38 core-infradev-aeuw1-bastion-019d82fb82b7a2281 consul[6268]: 2016/10/27 09:57:38 [WARN] raft: Rejecting vote request from 10.206.49.29:8300 since we have a leader: 10.206.50.87:8300
Server name: bastion1.infradev.core.example.com
10.206.49.29
-- Logs begin at Thu 2016-10-27 09:02:50 UTC, end at Thu 2016-10-27 09:57:42 UTC. --
Oct 27 09:56:55 core-infradev-aeuw1-bastion-0f6497cbe297c558d consul[5790]: 2016/10/27 09:56:55 [WARN] raft: Rejecting vote request from 10.206.48.109:8300 since we have a leader: 10.206.50.87:8300
Oct 27 09:57:01 core-infradev-aeuw1-bastion-0f6497cbe297c558d consul[5790]: 2016/10/27 09:57:01 [WARN] raft: Rejecting vote request from 10.206.48.109:8300 since we have a leader: 10.206.50.87:8300
Oct 27 09:57:09 core-infradev-aeuw1-bastion-0f6497cbe297c558d consul[5790]: 2016/10/27 09:57:09 [WARN] raft: Rejecting vote request from 10.206.48.109:8300 since we have a leader: 10.206.50.87:8300
Oct 27 09:57:16 core-infradev-aeuw1-bastion-0f6497cbe297c558d consul[5790]: 2016/10/27 09:57:16 [WARN] raft: Rejecting vote request from 10.206.48.109:8300 since we have a leader: 10.206.50.87:8300
Oct 27 09:57:22 core-infradev-aeuw1-bastion-0f6497cbe297c558d consul[5790]: 2016/10/27 09:57:22 [INFO] consul: New leader elected: core-infradev-aeuw1-bastion-006d7aa0c6cb46d89
Oct 27 09:57:29 core-infradev-aeuw1-bastion-0f6497cbe297c558d consul[5790]: 2016/10/27 09:57:29 [WARN] raft: Heartbeat timeout from "10.206.50.87:8300" reached, starting election
Oct 27 09:57:29 core-infradev-aeuw1-bastion-0f6497cbe297c558d consul[5790]: 2016/10/27 09:57:29 [INFO] raft: Node at 10.206.49.29:8300 [Candidate] entering Candidate state in term 283
Oct 27 09:57:38 core-infradev-aeuw1-bastion-0f6497cbe297c558d consul[5790]: 2016/10/27 09:57:38 [WARN] raft: Election timeout reached, restarting election
Oct 27 09:57:38 core-infradev-aeuw1-bastion-0f6497cbe297c558d consul[5790]: 2016/10/27 09:57:38 [INFO] raft: Node at 10.206.49.29:8300 [Candidate] entering Candidate state in term 284
Oct 27 09:57:38 core-infradev-aeuw1-bastion-0f6497cbe297c558d consul[5790]: 2016/10/27 09:57:38 [ERR] agent: coordinate update error: No cluster leader
Server name: bastion2.infradev.core.example.com
10.206.50.87
-- Logs begin at Wed 2016-10-26 13:53:47 UTC, end at Thu 2016-10-27 09:57:43 UTC. --
Oct 27 09:57:22 core-infradev-aeuw1-bastion-006d7aa0c6cb46d89 consul[22013]: 2016/10/27 09:57:22 [INFO] raft: Updating configuration with RemoveServer (10.206.48.109:8300, ) to [{Suffrage:Voter ID:10.206.50.87:8300 Address:10.206.50.87:8300}]
Oct 27 09:57:22 core-infradev-aeuw1-bastion-006d7aa0c6cb46d89 consul[22013]: 2016/10/27 09:57:22 [INFO] raft: Removed peer 10.206.48.109:8300, stopping replication after 13985
Oct 27 09:57:22 core-infradev-aeuw1-bastion-006d7aa0c6cb46d89 consul[22013]: 2016/10/27 09:57:22 [INFO] raft: Updating configuration with AddStaging (10.206.48.109:8300, 10.206.48.109:8300) to [{Suffrage:Voter ID:10.206.50.87:8300 Address:10.206.50.87:8300} {Suffrage:Voter ID:10.206.48.109:8300 Address:10.206.48.109:8300}]
Oct 27 09:57:22 core-infradev-aeuw1-bastion-006d7aa0c6cb46d89 consul[22013]: 2016/10/27 09:57:22 [INFO] raft: aborting pipeline replication to peer {Voter 10.206.48.109:8300 10.206.48.109:8300}
Oct 27 09:57:22 core-infradev-aeuw1-bastion-006d7aa0c6cb46d89 consul[22013]: 2016/10/27 09:57:22 [INFO] raft: Added peer 10.206.48.109:8300, starting replication
Oct 27 09:57:22 core-infradev-aeuw1-bastion-006d7aa0c6cb46d89 consul[22013]: 2016/10/27 09:57:22 [WARN] raft: AppendEntries to {Voter 10.206.48.109:8300 10.206.48.109:8300} rejected, sending older logs (next: 13986)
Oct 27 09:57:22 core-infradev-aeuw1-bastion-006d7aa0c6cb46d89 consul[22013]: 2016/10/27 09:57:22 [INFO] raft: pipelining replication to peer {Voter 10.206.48.109:8300 10.206.48.109:8300}
Oct 27 09:57:29 core-infradev-aeuw1-bastion-006d7aa0c6cb46d89 consul[22013]: 2016/10/27 09:57:29 [WARN] raft: Rejecting vote request from 10.206.49.29:8300 since we have a leader: 10.206.50.87:8300
Oct 27 09:57:38 core-infradev-aeuw1-bastion-006d7aa0c6cb46d89 consul[22013]: 2016/10/27 09:57:38 [WARN] raft: Rejecting vote request from 10.206.49.29:8300 since we have a leader: 10.206.50.87:8300
Oct 27 09:57:41 core-infradev-aeuw1-bastion-006d7aa0c6cb46d89 consul[22013]: 2016/10/27 09:57:41 [INFO] serf: attempting reconnect to core-infradev-aeuw1-bastion-060ad1b6053c8d6b0.core-infradev 10.206.48.109:8302
We're now running with the default raft_multiplier
of 5
on m3.medium
instances. One thing to note is that these servers are bastion hosts that mount an ENI on boot, so that the IP address is always the same .... but we do set a hostname which contains the instance ID, could this be a causing the issue?
Any additional pointers would be highly appreciated
An update from us regarding our problems with leader reelections we've been seeing: this issue went away when we've moved consul servers onto a different set of servers in our infrastructure.
We previously had consul running on our bastion/NAT nodes, which attach an ENI during instance boot; to simplify the setup, once we've attached the secondary interface, eth1
, we shutdown eth0
.... it looks as though doing this is causing the issues we described above.
The servers we've moved to don't have an additional interface attached. In order to implement this, we had to move to using the new retry_join_ec2
features in 0.7.1
, since there is no static address to connect to any longer (we were setting the retry_join
config to the ENI of the bastion hosts previously). Since moving the consul servers to these new hosts, we can happily kill off the servers without experiencing the same problems we saw previously.
Thanks for the update - closing this out!
hello, i have a question about: consul select leader frequently, the config as fellow:
agent1 log:
i found out that the config in raft is set default:
so if reach LeaderLeaseTimeout, the cluster will select a new leader, but my network is bad, so the Network latency always > 500ms. i want to change this value, so 3 question:
thank you very much.