Closed discordianfish closed 7 years ago
Hey @discordianfish, the consul members
and /v1/status/peers
output come from different sources; the former being from the gossip layer and the latter being from the Raft layer. That explains why they may be showing different peer sets.
Are the old members fully dead (unreachable)? Were they forcefully terminated, or were they shut down gracefully? A graceful shutdown allows the node to announce its intention of leaving whereas a force shutdown leaves the member in the peer list for another 72h in case it comes back.
Can you share your full configuration and some of the logs from each server at start time? The bootstrap options are important here and this will help paint a more complete picture of what's going on.
To add to what @ryanuber said, it would also be useful to see the /peers output from each of the servers other than 10-1-24-247 to see if they look like they are in healthy state.
I suspected something like that, but how is it possible that leader isn't included in peers?
I have a bunch of servers which didn't leave the cluster properly, so it's quite possible that this contributes to this problem (working on this issue right now). Still, it looks to me like there is also some issue with consul itself leading to the inconsistency between leader and peers.
For the configuration I basically ask the aws api for instances for the instances stack (using cloudformation here). Additionally to that, I use another tag to figure out if an instance is suppose to be a server or now. All nodes use this config:
{
"ports": {
"dns": 53
},
"disable_remote_exec": true,
"leave_on_terminate": true
}
...and I just realize that I still explicitly set -protocol 2
on all nodes.
On servers nodes I start consul with those parameters:
consul agent -data-dir /var/lib/consul -config-dir=/etc/consul -protocol 2 -ui-dir /usr/share/consul-ui -client 0.0.0.0 -server -bootstrap-expect 3 -retry-join 10.1.41.72 -retry-join 10.1.29.94 -retry-join 10.1.9.157
... where the IPs are those of the other instances tagged as server.
On client nodes, I run:
/usr/bin/consul agent -data-dir /var/lib/consul -config-dir=/etc/consul -protocol 2 -ui-dir /usr/share/consul-ui -client 0.0.0.0 -retry-join 10.1.41.72 -retry-join 10.1.29.94 -retry-join 10.1.9.15
...where again the IPs are those of the server instance when consul got started.
@slackpad: Here is the peers output from each server:
10.1.29.94: ["10.1.41.72:8300","10.1.29.94:8300","10.1.9.157:8300"]
10.1.41.72: ["10.1.41.72:8300","10.1.29.94:8300","10.1.9.157:8300"]
10.1.9.157: ["10.1.9.157:8300","10.1.41.212:8300","10.1.31.178:8300","10.1.12.90:8300"]
Here are a few lines from the log of 10.1.9.157:
2016-01-04_18:13:33.12344 2016/01/04 18:13:33 [INFO] snapshot: reaping snapshot /var/lib/consul/raft/snapshots/570-358870-1451930953138
2016-01-04_18:13:33.12344 2016/01/04 18:13:33 [INFO] raft: Compacting logs from 109473 to 109501
2016-01-04_18:13:33.12581 2016/01/04 18:13:33 [INFO] raft: Snapshot to 358870 complete
2016-01-04_18:16:16.81566 2016/01/04 18:16:16 [INFO] consul.fsm: snapshot created in 24.163µs
2016-01-04_18:16:16.81567 2016/01/04 18:16:16 [INFO] raft: Starting snapshot up to 358870
2016-01-04_18:16:16.81567 2016/01/04 18:16:16 [INFO] snapshot: Creating new snapshot at /var/lib/consul/raft/snapshots/570-358870-1451931376812.tmp
2016-01-04_18:16:16.81568 2016/01/04 18:16:16 [INFO] snapshot: reaping snapshot /var/lib/consul/raft/snapshots/570-358870-1451931077231
2016-01-04_18:16:16.81568 2016/01/04 18:16:16 [INFO] raft: Compacting logs from 109502 to 109550
2016-01-04_18:16:17.01989 2016/01/04 18:16:17 [INFO] raft: Snapshot to 358870 complete
2016-01-04_18:17:18.15188 2016/01/04 18:17:18 [INFO] serf: attempting reconnect to ip-10-1-14-234 10.1.14.234:8301
And here some of 10.1.29.94:
2016-01-04_16:20:12.11471 2016/01/04 16:20:12 [INFO] snapshot: reaping snapshot /var/lib/consul/raft/snapshots/870-68044-1451923908831
2016-01-04_16:20:12.11487 2016/01/04 16:20:12 [INFO] raft: Compacting logs from 69782 to 85239
2016-01-04_16:20:12.16272 2016/01/04 16:20:12 [INFO] raft: Snapshot to 95479 complete
2016-01-04_16:22:11.15198 2016/01/04 16:22:11 [INFO] memberlist: Marking ip-10-1-30-188 as failed, suspect timeout reached
2016-01-04_16:22:11.15199 2016/01/04 16:22:11 [INFO] serf: EventMemberFailed: ip-10-1-30-188 10.1.30.188
2016-01-04_16:22:12.52031 2016/01/04 16:22:12 [INFO] memberlist: Marking ip-10-1-14-234 as failed, suspect timeout reached
2016-01-04_16:22:12.52032 2016/01/04 16:22:12 [INFO] serf: EventMemberFailed: ip-10-1-14-234 10.1.14.234
2016-01-04_16:22:13.18879 2016/01/04 16:22:13 [INFO] serf: EventMemberFailed: ip-10-1-46-44 10.1.46.44
2016-01-04_16:22:55.17294 2016/01/04 16:22:55 [INFO] serf: attempting reconnect to ip-10-1-41-212 10.1.41.212:8301
2016-01-04_16:24:05.17385 2016/01/04 16:24:05 [INFO] serf: attempting reconnect to ip-10-1-41-212 10.1.41.212:8301
2016-01-04_16:24:25.70454 2016/01/04 16:24:25 [INFO] serf: EventMemberJoin: ip-10-1-25-3 10.1.25.3
2016-01-04_16:24:25.88596 2016/01/04 16:24:25 [INFO] serf: EventMemberJoin: ip-10-1-44-13 10.1.44.13
2016-01-04_16:24:27.01968 2016/01/04 16:24:27 [INFO] serf: EventMemberJoin: ip-10-1-10-17 10.1.10.17
2016-01-04_16:24:45.17407 2016/01/04 16:24:45 [INFO] serf: attempting reconnect to ip-10-1-14-234 10.1.14.234:8301
2016-01-04_16:24:47.38073 2016/01/04 16:24:47 [INFO] consul.fsm: snapshot created in 24.815µs
2016-01-04_16:24:47.38082 2016/01/04 16:24:47 [INFO] raft: Starting snapshot up to 108285
2016-01-04_16:24:47.38087 2016/01/04 16:24:47 [INFO] snapshot: Creating new snapshot at /var/lib/consul/raft/snapshots/870-108285-145192
4687380.tmp
2016-01-04_16:24:47.39008 2016/01/04 16:24:47 [INFO] snapshot: reaping snapshot /var/lib/consul/raft/snapshots/870-80020-1451924128158
2016-01-04_16:24:47.39028 2016/01/04 16:24:47 [INFO] raft: Compacting logs from 85240 to 98046
2016-01-04_16:24:47.43469 2016/01/04 16:24:47 [INFO] raft: Snapshot to 108285 complete
2016-01-04_16:25:19.54398 2016/01/04 16:25:19 [INFO] memberlist: Marking ip-10-1-30-187 as failed, suspect timeout reached
2016-01-04_16:25:19.54411 2016/01/04 16:25:19 [INFO] serf: EventMemberFailed: ip-10-1-30-187 10.1.30.187
2016-01-04_16:25:29.55614 2016/01/04 16:25:29 [INFO] memberlist: Marking ip-10-1-46-43 as failed, suspect timeout reached
2016-01-04_16:25:29.55626 2016/01/04 16:25:29 [INFO] serf: EventMemberFailed: ip-10-1-46-43 10.1.46.43
2016-01-04_16:25:41.13007 2016/01/04 16:25:41 [INFO] memberlist: Suspect ip-10-1-14-232 has failed, no acks received
2016-01-04_16:25:46.14224 2016/01/04 16:25:46 [INFO] serf: EventMemberFailed: ip-10-1-14-232 10.1.14.232
2016-01-04_16:27:49.91423 2016/01/04 16:27:49 [INFO] serf: EventMemberJoin: ip-10-1-44-56 10.1.44.56
^- Possibly when I did a rolling replacement of the instances.
Currently I only see the (excepted) serf issues due to the non-gracefully removed instances:
2016-01-04_18:20:15.21293 2016/01/04 18:20:15 [INFO] serf: attempting reconnect to ip-10-1-46-44 10.1.46.44:8301
2016-01-04_18:21:55.21378 2016/01/04 18:21:55 [INFO] serf: attempting reconnect to ip-10-1-30-187 10.1.30.187:8301
2016-01-04_18:22:28.21203 2016/01/04 18:22:28 [INFO] serf: attempting reconnect to ip-10-1-10-181 10.1.10.181:8301
I can also provide the complete log (well those I still have, so only from the currently running instances) if necessary. Just need to spend some time scrubbing them.
@discordianfish is it possible that you did a rolling restart of the servers without giving them time to rejoin and become peers again? That might have pushed your cluster into an outage state. If your config file has any use of -bootstrap
you could end up in a split brain situation like this as well.
In any case, it looks like .157 is in a bad state where it has peers that are gone, and it hasn't added the other two good server nodes. I'd probably make that one leave and add a new server, or restart .157 with a clean data-dir, after making sure you are not using -bootstrap
in your config.
@slackpad It should have waited for the server nodes to successfully join the cluster. Before continuing with the next instance, I run:
IP=$(ip addr show dev eth0|awk '/inet /{print $2}'|cut -d/ -f1)
while ! curl -s http://localhost:8500/v1/status/peers | grep -q $IP:; do echo Waiting for consul; sleep 1; done
That should make sure the node successfully joined.. And as far as I understand, -bootstrap
should be okay as long as the nodes I point it to are already bootstrapped..
I see how it's simpler to reason about the state if -bootstrap
is removed, yet that isn't that trivial to automate.
I'd need to dig deeper in Raft to confirm but I think there are cases where the server might be the only peer in there. It may robustify your script above to make sure the server's IP is in there, and that there are N entries total in the peers list before moving on.
That should make sure the node successfully joined.. And as far as I understand, -bootstrap should be okay as long as the nodes I point it to are already bootstrapped..
This could still be dangerous for causing split-brains. It is better to use bootstrap-expect
and set a retry-join
list or similar.
This is very similar to https://github.com/hashicorp/consul/issues/1560, so linking these.
To clarify: I use bootstrap-expect 3 and retry-join.
This is slightly off-topic, but since I came here looking for a robust way to wait for a consul cluster to self-assemble (in a context where I know it eventually will) perhaps others will be interested.
Right now I'm attempting to wait with a poll loop that explicitly asks for a lock: consul lock -n 64 wait-for-consul echo Consul is up
. It seems that if this command succeeds (in the exit code sense) then I am good to go. @slackpad How would you rate this tactic vs. polling the peers list?
..following that train of thought, is there an argument against consul supporting something like consul watch -type consul -event healthy echo Consul is up
?
@phs I think your use of consul lock
looks like it would work well - that will only pass if a write is able to get through (actually two since it has to make a session as well) so there needs to be a quorum of servers up and running to accomplish that. This should be more reliable than polling the peers list.
It's a good suggestion to make a first-class "is consul up" command - we can keep track of that here.
@slackpad We have a similar situation, where the /v1/status/peers
endpoint shows (besides all current masters) a peer that has been deleted come time ago. consul members
doesn't show that node. The output is the same for all three master nodes.
The outage recovery document (https://www.consul.io/docs/guides/outage.html) says I can fix that by stopping all masters, editing the peers.json file and starting the servers again. I was wondering if there is also a way to fix this while keeping the cluster alive?
I'd like to prevent downtime if I can. And except for some raft messages saying the peer cannot be found for voting Failed to make RequestVote RPC
, the cluster seems to be operating fine otherwise...
@amochtar unfortunately there's currently not a way to force a peer out if they are no longer in the cluster's member list, so stopping and updating the peers list is the way to fix it. The danger with leaving it around is that it will increase your quorum size. For example, if you have 3 good + 1 zombie server then your quorum size would be 3, so losing one of your good servers could cause an outage. If you remove that zombie server then your quorum size drops to 2 and you will be able to handle the outage of a server as expected.
That's too bad... And what would happen if I start a new peer on the same IP and have it join the cluster, wait for it to sync and then gracefully leave the cluster?
@amochtar that should work if you can give it the same IP. You can use consul leave
to make sure it's gone from the cluster before you retire it.
@slackpad that worked :) created a new node with the old IP address, started a new consul agent, joined the existing cluster, then left again and it nicely cleaned the peers list 👍
what is the purpose of this api?
I thought it would be indicative of the current peers, not the configured peers.
Hi @lswith it does show the current peers - the Raft library calls that the "configuration" - it doesn't have to do with any configuration files.
It is interesting though because the info command shows a different number for the amount of peers. I thought that this API would expose that instead?
Consul 0.8 added https://www.consul.io/docs/guides/autopilot.html which will automatically clean up dead servers to keep things in sync.
Hi,
I'm running a 3 node consul 0.6.0 cluster on AWS. After doing some rolling upgrades which replaced the consul server instances (not sure if relevant though), my cluster got into a weird state:
The output of
consul members
seem to reflect reality: All those nodes are running consul server, are reachable and the leader is among them. The/status/peers
list on the other hand includes only one reachable system. It's also not the list of -(retry-)join parameters as you see (Those IPs are old which AFAIK is okay (let me know if not) since they are only used for initial bootstrapping anyway. I just keep things )