consul members and v1/status/peers inconsistent

discordianfish commented 8 years ago

Hi,

I'm running a 3 node consul 0.6.0 cluster on AWS. After doing some rolling upgrades which replaced the consul server instances (not sure if relevant though), my cluster got into a weird state:

admin@ip-10-1-24-247:~$ consul members|grep alive.*server
ip-10-1-29-94   10.1.29.94:8301   alive   server  0.6.0  2         dc1
ip-10-1-41-72   10.1.41.72:8301   alive   server  0.6.0  2         dc1
ip-10-1-9-157   10.1.9.157:8301   alive   server  0.6.0  2         dc1

admin@ip-10-1-24-247:~$ curl localhost:8500/v1/status/peers
["10.1.9.157:8300","10.1.41.212:8300","10.1.31.178:8300","10.1.12.90:8300"]

admin@ip-10-1-24-247:~$ curl localhost:8500/v1/status/leader
"10.1.41.72:8300"

admin@ip-10-1-24-247:~$ ps -ef|grep consul
root       408   338  8 11:45 ?        00:29:48 /usr/bin/consul agent -data-dir /var/lib/consul -config-dir=/etc/consul -protocol 2 -ui-dir /usr/share/consul-ui -client 0.0.0.0 -retry-join 10.1.14.123 -retry-join 10.1.24.232 -retry-join 10.1.40.228

The output of consul members seem to reflect reality: All those nodes are running consul server, are reachable and the leader is among them. The /status/peers list on the other hand includes only one reachable system. It's also not the list of -(retry-)join parameters as you see (Those IPs are old which AFAIK is okay (let me know if not) since they are only used for initial bootstrapping anyway. I just keep things )

ryanuber commented 8 years ago

Hey @discordianfish, the consul members and /v1/status/peers output come from different sources; the former being from the gossip layer and the latter being from the Raft layer. That explains why they may be showing different peer sets.

Are the old members fully dead (unreachable)? Were they forcefully terminated, or were they shut down gracefully? A graceful shutdown allows the node to announce its intention of leaving whereas a force shutdown leaves the member in the peer list for another 72h in case it comes back.

Can you share your full configuration and some of the logs from each server at start time? The bootstrap options are important here and this will help paint a more complete picture of what's going on.

slackpad commented 8 years ago

To add to what @ryanuber said, it would also be useful to see the /peers output from each of the servers other than 10-1-24-247 to see if they look like they are in healthy state.

discordianfish commented 8 years ago

I suspected something like that, but how is it possible that leader isn't included in peers?

I have a bunch of servers which didn't leave the cluster properly, so it's quite possible that this contributes to this problem (working on this issue right now). Still, it looks to me like there is also some issue with consul itself leading to the inconsistency between leader and peers.

For the configuration I basically ask the aws api for instances for the instances stack (using cloudformation here). Additionally to that, I use another tag to figure out if an instance is suppose to be a server or now. All nodes use this config:

{
  "ports": {
    "dns": 53
  },
  "disable_remote_exec": true,
  "leave_on_terminate": true
}

...and I just realize that I still explicitly set -protocol 2 on all nodes.

On servers nodes I start consul with those parameters:

consul agent -data-dir /var/lib/consul -config-dir=/etc/consul -protocol 2 -ui-dir /usr/share/consul-ui -client 0.0.0.0 -server -bootstrap-expect 3 -retry-join 10.1.41.72 -retry-join 10.1.29.94 -retry-join 10.1.9.157

... where the IPs are those of the other instances tagged as server.

On client nodes, I run:

/usr/bin/consul agent -data-dir /var/lib/consul -config-dir=/etc/consul -protocol 2 -ui-dir /usr/share/consul-ui -client 0.0.0.0 -retry-join 10.1.41.72 -retry-join 10.1.29.94 -retry-join 10.1.9.15

...where again the IPs are those of the server instance when consul got started.

@slackpad: Here is the peers output from each server:

10.1.29.94: ["10.1.41.72:8300","10.1.29.94:8300","10.1.9.157:8300"]
10.1.41.72: ["10.1.41.72:8300","10.1.29.94:8300","10.1.9.157:8300"]
10.1.9.157: ["10.1.9.157:8300","10.1.41.212:8300","10.1.31.178:8300","10.1.12.90:8300"]

Here are a few lines from the log of 10.1.9.157:

2016-01-04_18:13:33.12344     2016/01/04 18:13:33 [INFO] snapshot: reaping snapshot /var/lib/consul/raft/snapshots/570-358870-1451930953138
2016-01-04_18:13:33.12344     2016/01/04 18:13:33 [INFO] raft: Compacting logs from 109473 to 109501
2016-01-04_18:13:33.12581     2016/01/04 18:13:33 [INFO] raft: Snapshot to 358870 complete
2016-01-04_18:16:16.81566     2016/01/04 18:16:16 [INFO] consul.fsm: snapshot created in 24.163µs
2016-01-04_18:16:16.81567     2016/01/04 18:16:16 [INFO] raft: Starting snapshot up to 358870
2016-01-04_18:16:16.81567     2016/01/04 18:16:16 [INFO] snapshot: Creating new snapshot at /var/lib/consul/raft/snapshots/570-358870-1451931376812.tmp
2016-01-04_18:16:16.81568     2016/01/04 18:16:16 [INFO] snapshot: reaping snapshot /var/lib/consul/raft/snapshots/570-358870-1451931077231
2016-01-04_18:16:16.81568     2016/01/04 18:16:16 [INFO] raft: Compacting logs from 109502 to 109550
2016-01-04_18:16:17.01989     2016/01/04 18:16:17 [INFO] raft: Snapshot to 358870 complete
2016-01-04_18:17:18.15188     2016/01/04 18:17:18 [INFO] serf: attempting reconnect to ip-10-1-14-234 10.1.14.234:8301

And here some of 10.1.29.94:

2016-01-04_16:20:12.11471     2016/01/04 16:20:12 [INFO] snapshot: reaping snapshot /var/lib/consul/raft/snapshots/870-68044-1451923908831
2016-01-04_16:20:12.11487     2016/01/04 16:20:12 [INFO] raft: Compacting logs from 69782 to 85239
2016-01-04_16:20:12.16272     2016/01/04 16:20:12 [INFO] raft: Snapshot to 95479 complete
2016-01-04_16:22:11.15198     2016/01/04 16:22:11 [INFO] memberlist: Marking ip-10-1-30-188 as failed, suspect timeout reached
2016-01-04_16:22:11.15199     2016/01/04 16:22:11 [INFO] serf: EventMemberFailed: ip-10-1-30-188 10.1.30.188
2016-01-04_16:22:12.52031     2016/01/04 16:22:12 [INFO] memberlist: Marking ip-10-1-14-234 as failed, suspect timeout reached
2016-01-04_16:22:12.52032     2016/01/04 16:22:12 [INFO] serf: EventMemberFailed: ip-10-1-14-234 10.1.14.234
2016-01-04_16:22:13.18879     2016/01/04 16:22:13 [INFO] serf: EventMemberFailed: ip-10-1-46-44 10.1.46.44
2016-01-04_16:22:55.17294     2016/01/04 16:22:55 [INFO] serf: attempting reconnect to ip-10-1-41-212 10.1.41.212:8301
2016-01-04_16:24:05.17385     2016/01/04 16:24:05 [INFO] serf: attempting reconnect to ip-10-1-41-212 10.1.41.212:8301
2016-01-04_16:24:25.70454     2016/01/04 16:24:25 [INFO] serf: EventMemberJoin: ip-10-1-25-3 10.1.25.3
2016-01-04_16:24:25.88596     2016/01/04 16:24:25 [INFO] serf: EventMemberJoin: ip-10-1-44-13 10.1.44.13
2016-01-04_16:24:27.01968     2016/01/04 16:24:27 [INFO] serf: EventMemberJoin: ip-10-1-10-17 10.1.10.17
2016-01-04_16:24:45.17407     2016/01/04 16:24:45 [INFO] serf: attempting reconnect to ip-10-1-14-234 10.1.14.234:8301
2016-01-04_16:24:47.38073     2016/01/04 16:24:47 [INFO] consul.fsm: snapshot created in 24.815µs
2016-01-04_16:24:47.38082     2016/01/04 16:24:47 [INFO] raft: Starting snapshot up to 108285
2016-01-04_16:24:47.38087     2016/01/04 16:24:47 [INFO] snapshot: Creating new snapshot at /var/lib/consul/raft/snapshots/870-108285-145192
4687380.tmp
2016-01-04_16:24:47.39008     2016/01/04 16:24:47 [INFO] snapshot: reaping snapshot /var/lib/consul/raft/snapshots/870-80020-1451924128158
2016-01-04_16:24:47.39028     2016/01/04 16:24:47 [INFO] raft: Compacting logs from 85240 to 98046
2016-01-04_16:24:47.43469     2016/01/04 16:24:47 [INFO] raft: Snapshot to 108285 complete
2016-01-04_16:25:19.54398     2016/01/04 16:25:19 [INFO] memberlist: Marking ip-10-1-30-187 as failed, suspect timeout reached
2016-01-04_16:25:19.54411     2016/01/04 16:25:19 [INFO] serf: EventMemberFailed: ip-10-1-30-187 10.1.30.187
2016-01-04_16:25:29.55614     2016/01/04 16:25:29 [INFO] memberlist: Marking ip-10-1-46-43 as failed, suspect timeout reached
2016-01-04_16:25:29.55626     2016/01/04 16:25:29 [INFO] serf: EventMemberFailed: ip-10-1-46-43 10.1.46.43
2016-01-04_16:25:41.13007     2016/01/04 16:25:41 [INFO] memberlist: Suspect ip-10-1-14-232 has failed, no acks received
2016-01-04_16:25:46.14224     2016/01/04 16:25:46 [INFO] serf: EventMemberFailed: ip-10-1-14-232 10.1.14.232
2016-01-04_16:27:49.91423     2016/01/04 16:27:49 [INFO] serf: EventMemberJoin: ip-10-1-44-56 10.1.44.56

^- Possibly when I did a rolling replacement of the instances.

Currently I only see the (excepted) serf issues due to the non-gracefully removed instances:

2016-01-04_18:20:15.21293     2016/01/04 18:20:15 [INFO] serf: attempting reconnect to ip-10-1-46-44 10.1.46.44:8301
2016-01-04_18:21:55.21378     2016/01/04 18:21:55 [INFO] serf: attempting reconnect to ip-10-1-30-187 10.1.30.187:8301
2016-01-04_18:22:28.21203     2016/01/04 18:22:28 [INFO] serf: attempting reconnect to ip-10-1-10-181 10.1.10.181:8301

I can also provide the complete log (well those I still have, so only from the currently running instances) if necessary. Just need to spend some time scrubbing them.

slackpad commented 8 years ago

@discordianfish is it possible that you did a rolling restart of the servers without giving them time to rejoin and become peers again? That might have pushed your cluster into an outage state. If your config file has any use of -bootstrap you could end up in a split brain situation like this as well.

In any case, it looks like .157 is in a bad state where it has peers that are gone, and it hasn't added the other two good server nodes. I'd probably make that one leave and add a new server, or restart .157 with a clean data-dir, after making sure you are not using -bootstrap in your config.

discordianfish commented 8 years ago

@slackpad It should have waited for the server nodes to successfully join the cluster. Before continuing with the next instance, I run:

          IP=$(ip addr show dev eth0|awk '/inet /{print $2}'|cut -d/ -f1)
          while ! curl -s http://localhost:8500/v1/status/peers | grep -q $IP:; do echo Waiting for consul; sleep 1; done

That should make sure the node successfully joined.. And as far as I understand, -bootstrap should be okay as long as the nodes I point it to are already bootstrapped.. I see how it's simpler to reason about the state if -bootstrap is removed, yet that isn't that trivial to automate.

slackpad commented 8 years ago

I'd need to dig deeper in Raft to confirm but I think there are cases where the server might be the only peer in there. It may robustify your script above to make sure the server's IP is in there, and that there are N entries total in the peers list before moving on.

That should make sure the node successfully joined.. And as far as I understand, -bootstrap should be okay as long as the nodes I point it to are already bootstrapped..

This could still be dangerous for causing split-brains. It is better to use bootstrap-expect and set a retry-join list or similar.

This is very similar to https://github.com/hashicorp/consul/issues/1560, so linking these.

discordianfish commented 8 years ago

To clarify: I use bootstrap-expect 3 and retry-join.

phs commented 8 years ago

This is slightly off-topic, but since I came here looking for a robust way to wait for a consul cluster to self-assemble (in a context where I know it eventually will) perhaps others will be interested.

Right now I'm attempting to wait with a poll loop that explicitly asks for a lock: consul lock -n 64 wait-for-consul echo Consul is up. It seems that if this command succeeds (in the exit code sense) then I am good to go. @slackpad How would you rate this tactic vs. polling the peers list?

phs commented 8 years ago

..following that train of thought, is there an argument against consul supporting something like consul watch -type consul -event healthy echo Consul is up?

slackpad commented 8 years ago

@phs I think your use of consul lock looks like it would work well - that will only pass if a write is able to get through (actually two since it has to make a session as well) so there needs to be a quorum of servers up and running to accomplish that. This should be more reliable than polling the peers list.

It's a good suggestion to make a first-class "is consul up" command - we can keep track of that here.

amochtar commented 8 years ago

@slackpad We have a similar situation, where the /v1/status/peers endpoint shows (besides all current masters) a peer that has been deleted come time ago. consul members doesn't show that node. The output is the same for all three master nodes.

The outage recovery document (https://www.consul.io/docs/guides/outage.html) says I can fix that by stopping all masters, editing the peers.json file and starting the servers again. I was wondering if there is also a way to fix this while keeping the cluster alive?

I'd like to prevent downtime if I can. And except for some raft messages saying the peer cannot be found for voting Failed to make RequestVote RPC, the cluster seems to be operating fine otherwise...

slackpad commented 8 years ago

@amochtar unfortunately there's currently not a way to force a peer out if they are no longer in the cluster's member list, so stopping and updating the peers list is the way to fix it. The danger with leaving it around is that it will increase your quorum size. For example, if you have 3 good + 1 zombie server then your quorum size would be 3, so losing one of your good servers could cause an outage. If you remove that zombie server then your quorum size drops to 2 and you will be able to handle the outage of a server as expected.

amochtar commented 8 years ago

That's too bad... And what would happen if I start a new peer on the same IP and have it join the cluster, wait for it to sync and then gracefully leave the cluster?

slackpad commented 8 years ago

@amochtar that should work if you can give it the same IP. You can use consul leave to make sure it's gone from the cluster before you retire it.

amochtar commented 8 years ago

@slackpad that worked :) created a new node with the old IP address, started a new consul agent, joined the existing cluster, then left again and it nicely cleaned the peers list 👍

lswith commented 7 years ago

what is the purpose of this api?

I thought it would be indicative of the current peers, not the configured peers.

slackpad commented 7 years ago

Hi @lswith it does show the current peers - the Raft library calls that the "configuration" - it doesn't have to do with any configuration files.

lswith commented 7 years ago

It is interesting though because the info command shows a different number for the amount of peers. I thought that this API would expose that instead?

slackpad commented 7 years ago

Consul 0.8 added https://www.consul.io/docs/guides/autopilot.html which will automatically clean up dead servers to keep things in sync.

hashicorp / consul

consul members and v1/status/peers inconsistent #1562