hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.3k stars 4.42k forks source link

Consul servers won't elect a leader #993

Closed eirslett closed 7 years ago

eirslett commented 9 years ago

I have 3 consul servers running (+ a handful of other nodes), and they can all speak to each other - or so I think; at least they're sending UDP messages between themselves. The logs still show [ERR] agent: failed to sync remote state: No cluster leader, so even if the servers know about each other, it looks like they fail to perform an actual leader election... Is there a way to trigger a leader election manually?

I'm running consul 0.5.2 on all nodes.

markhu commented 8 years ago

Consul 0.7.0 provides a new diagnostic command:

consul operator raft --list-peers --stale=true
Operator "raft" subcommand failed: Unexpected response code: 500 (No cluster leader)
cnoffsin commented 8 years ago

@markhu

That's cool but the issue is essentially we want consul to elect a cluster leader in the event of a graceful or "un-graceful" restart of a server.

shankarkc commented 8 years ago

Yes. I am also hitting this issue.

 docker version
Client:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:54:52 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:54:52 2016
 OS/Arch:      linux/amd64

i used below code to bring up my cluster master node

 "sudo docker run -d  --restart=unless-stopped -p 8500:8500 --name=consul progrium/consul -server -bootstrap"
 "sudo docker run -d  --restart=unless-stopped -p 4000:4000 swarm manage -H :4000 --replication --advertise ${aws_instance.swarm_master.0.private_ip}:4000 consul://${aws_instance.swarm_master.0.private_ip}:8500"

i reboot machine hosting swam cluster master. i see that swarm process is running in docker. But it not connectable

root@XXXXXXX:~# docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                                            NAMES
32c227fbab03        swarm               "/swarm join --advert"   13 hours ago        Up 13 hours         2375/tcp                                                                         romantic_joliot
f8427a36e1f4        swarm               "/swarm manage -H :40"   13 hours ago        Up 13 hours         2375/tcp, 0.0.0.0:4000->4000/tcp                                                 backstabbing_hoover
44df7d59752d        progrium/consul     "/bin/start -server -"   13 hours ago        Up 13 hours         53/tcp, 53/udp, 8300-8302/tcp, 8400/tcp, 8301-8302/udp, 0.0.0.0:8500->8500/tcp   consul

I cant list members

root@XXXXXXXX:~#  docker  run  swarm list consul://$IP:8500
time="2016-10-09T06:46:17Z" level=info msg="Initializing discovery without TLS"
2016/10/09 06:46:17 Unexpected response code: 500

Then i saw docker logs for the swarm container

root@XXXXXXXX:~# docker logs f8427a36e1f4
time="2016-10-09T06:35:34Z" level=info msg="Leader Election: Cluster leadership    lost"
time="2016-10-09T06:35:34Z" level=error msg="Unexpected response code: 500 (No cluster leader)"
time="2016-10-09T06:35:41Z" level=error msg="Discovery error: Unexpected response code: 500"
time="2016-10-09T06:35:41Z" level=error msg="Discovery error: Unexpected response code: 500 (No cluster leader)"
time="2016-10-09T06:35:41Z" level=error msg="Discovery error: Unexpected watch error"

Is there a workaround to overcome this issue? If I restart my vm for some reason all containers hosted on this swarm cant run as the swarm is down

Thanks Shankar KC

flypenguin commented 7 years ago

I also just hit this same issue, which caused a downtime in our services. also using bootstrap_expect=3. what seemed to be the cause here is that a server was switched off over night, so the cluster was degraded. I tried adding more server nodes, but that did not help at all (they all went "blabla no cluster leader"). restarting the failed client did it.

the original cluster had 3 nodes, degraded to 2, which should be just fine (and was before). now the cluster has 5 nodes, which will degrade to 4 this night, and we shall see.

btw, the "new debug command" also did everything but work for me - first the "no cluster leader" error, now this: Operator "raft" subcommand failed: Unexpected response code: 500 (rpc error: rpc: can't find service Operator.RaftGetConfiguration) (consul v0.7.1)

slackpad commented 7 years ago

btw, the "new debug command" also did everything but work for me - first the "no cluster leader" error, now this: Operator "raft" subcommand failed: Unexpected response code: 500 (rpc error: rpc: can't find service Operator.RaftGetConfiguration) (consul v0.7.1)

Are you running Consul 0.7.0+ on all your servers or is it possible you have a mix of versions?

flypenguin commented 7 years ago

all consul masters are 0.7.1, now all consuls probably are. this morning only the masters.

slackpad commented 7 years ago

Ok - that RPC error looks like it may have talked to an old server. I'll need to look at the stale issue as that seems to have been reported by you and another person.

If a single server of a 3 server cluster going down causes an outage that's likely from a stale peer in the Raft configuration. You can use consul force-leave <node name> command to kick it if it has recently been removed (but still shows in consul members, or the consul operator raft -remove-peer -address="<ip:port>" if it's stale an no longer known in consul members. consul operator raft -list-peers should let you inspect the configuration to see if this is the case.

haf commented 7 years ago

Our staging environment went down in a similar fashion; here's my write-up https://gist.github.com/haf/1983206cf11846f6f3f291f78acee5cf

rhyas commented 7 years ago

Raising my hand as another one hitting this issue. We had the same thing, where the current leader aws node died, a new one was spun up, and nothing converged. We ran into the same issue trying to manually fix with the "no leader" making it difficult to find out which raft node is dead and remove it. Also tried doing the peers.json recovery, and that failed because the server wouldn't even start with that file in a format as documented. ): Our ultimate solution/fix was to blow away all 3 nodes and let it bootstrap from scratch. This left it disconnected from all the agents, but doing a join from to the agents that were all still part of the old cluster, brought everything back into sync. (Services anyway, didn't check KV data) Our cluster is all 0.7.2+. We're still in test mode, so no production impact from it, just some slowed development cycles and an injection of a yellow flag to the consul solution rollout.

This is very easy to reproduce. Setup a new 3 node cluster with --bootstrap 3, wait until it's all converged with a leader, then kill off the leader (terminate the instance). The cluster will never recover.

dcrystalj commented 7 years ago

Isn't this the most basic feature consul should support? Unbelievable it's still not working. Any workarounds?

slackpad commented 7 years ago

We've got automation coming in Consul 0.8 that'll fix this - https://github.com/hashicorp/consul/blob/master/website/source/docs/guides/autopilot.html.markdown.

flypenguin commented 7 years ago

that isso good to hear :) . our workaround is to scratch consul data dirs on EVERY master host, and re-run puppet which then re-sets consul. our set-up automation can handle that pretty well, without this we'd have been lost a couple of times.

rsrini83 commented 7 years ago

Hi We are also facing this issue(no leader elected after system restart). However our consul instance is running in a docker container on multiple EC2 instances. Can any one suggest what is the simple workaround in case of dockerization ?

slackpad commented 7 years ago

Closing this out now that Autopilot is available in 0.8.x - https://www.consul.io/docs/guides/autopilot.html.

We've also (in 0.7.x):

slackpad commented 7 years ago

We also (in 0.7.x) made this change:

Servers will now abort bootstrapping if they detect an existing cluster with configured Raft peers. This will help prevent safe but spurious leader elections when introducing new nodes with bootstrap_expect enabled into an existing cluster. [GH-2319]

edbergavera commented 7 years ago

@slackpad In our situation, we have 3-member consul deployed on Kubernetes cluster. Each member is in its own pod. We've recently made changes into our cluster and did a rolling-update. After that the 3 consuls are running fine as per status in Kubernetes but looking at the logs on each member it says no cluster leader. I am able to list all members with consul members (pls see below)

Node Address Status Type Build Protocol DC consul-consul-0 100.96.2.7:8301 alive server 0.7.5 2 dc1 consul-consul-1 100.96.1.3:8301 alive server 0.7.5 2 dc1 consul-consul-2 100.96.3.6:8301 alive server 0.7.5 2 dc1

Should I try the peers.json file?

slackpad commented 7 years ago

Hi @edbergavera if the servers are trying to elect a leader and there are dead servers in the quorum from the rolling update that's preventing it, then you would need to use peers.json per https://www.consul.io/docs/guides/outage.html#manual-recovery-using-peers-json.

edbergavera commented 7 years ago

Hello James,

I did follow the instructions described in outage document but to no avail. I think this is specific to Kubenetes pod issue with Consul. So, I ended up re-creating the cluster in Kubernetes and restored KVs and that worked.

Thank you for your suggestion and looking into this.

On Thu, May 18, 2017 at 11:27 PM, James Phillips notifications@github.com wrote:

Hi @edbergavera https://github.com/edbergavera if the servers are trying to elect a leader and there are dead servers in the quorum from the rolling update that's preventing it, then you would need to use peers.json per https://www.consul.io/docs/guides/outage.html#manual- recovery-using-peers-json.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/993#issuecomment-302439998, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhWeY8zsS3IDPVUk-KQDGxPO2lsE48fks5r7GN0gaJpZM4E0aKV .

-- Eduardo D. Bergavera, Jr. Linux Admin Email: edbergavera@gmail.com OpenID: https://launchpad.net/~edbergavera Github: https://github.com/edbergavera

eladitzhakian commented 7 years ago

Having this exact same issue with 0.8.1. A new leader is elected and then leadership is lost, election is restarted. Was able to recover using peers.json, praise the lord.

kyrelos commented 7 years ago

Experienced this issue when I activated raft_protocol version 3, reverting to raft_protocol version 2 fixed the issue. But still investigating why switch to v3 triggered the issue.

dgulinobw commented 7 years ago

Cluster of 5 running 0.9.0 will not elect a leader w/raft_protocol = 3, but will elect with raft_protocol = 2.

working config: consul.json: { "bootstrap_expect": 5, "retry_join": [ "a.a.a.a", "b.b.b.b", "c.c.c.c", "d.d.d.d", "e.e.e.e"], "server": true, "rejoin_after_leave": true, "enable_syslog": true, "data_dir": "/var/consul/data", "datacenter": "us-east-1", "recursor": "10.0.0.2", "advertise_addrs": { "serf_lan": "a.a.a.a:8301", "serf_wan": "a.a.a.a:8302", "rpc": "a.a.a.a:8300" }, "bind_addr": "z.z.z.z", "encrypt": "NANANANA", "ui": true, "encrypt_verify_incoming": true, "encrypt_verify_outgoing": true, "key_file": "/var/consul/data/pki/private/server.key", "cert_file": "/var/consul/data/pki/certs/server.crt", "ca_file": "/var/consul/data/pki/certs/ca.crt", "raft_protocol": 2, "protocol": 3 }

slackpad commented 7 years ago

Hi @dgulinobw can you please open a new issue and include a gist with the server logs when you see this? Thanks!

spuder commented 4 years ago

I also ran into this on a new consul cluster running 1.6.0

As soon as I made sure all the consul servers had both a default token and an agent token, the cluster was able to select a leader. Just having a default or agent token was insufficient.

token=11111111111
export CONSUL_HTTP_TOKEN=00000000000000
consul acl set-agent-token default $token
consul acl set-agent-token agent $token
cat /opt/consul/acl-tokens.json
kawsark commented 4 years ago

@spuder Interesting. What was your token policy for $token that you set for both default and agent?

ckvtvm commented 2 years ago

After a day of testing, this almost works. It starts in bootstrap-expect=1 and elects itself a leader. The others join and I have my cluster back. Unfortunately, I am running into a case, where it decides to give up as a leader. For some reason it detects long dead peers as active and wants to run election which it cannot win because well... the peers are really dead. Is this a bug or is there some reason for that?

http://pastebin.com/NR5RSvDq

you saved my day dear !