hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Server restart leads to client leader election issues #2371

Open ryanslade opened 8 years ago

ryanslade commented 8 years ago

consul version for both Client and Server

Client: Go API Client, HEAD Server: 0.7.0

Operating system and Environment details

Ubuntu Linux 14.04

Description of the Issue (and unexpected/desired result)

We are using client side leader election (https://github.com/hashicorp/consul/blob/d5b7530ec593f1ec2a8f8a7c145bcadafa88b572/api/lock.go#L132)

When leadership is acquired, two goroutines are created: https://github.com/hashicorp/consul/blob/d5b7530ec593f1ec2a8f8a7c145bcadafa88b572/api/lock.go#L151 https://github.com/hashicorp/consul/blob/d5b7530ec593f1ec2a8f8a7c145bcadafa88b572/api/lock.go#L239

When the current consul server leader is restarted, the monitor goroutine fails I assume because it is querying in consistent mode. This signals to the client that leadership was revoked. I would assume in this case to try and acquire leadership again since this is a blocking process. However, the renew goroutine started earlier continues to renew the old session which means that the lock appears to be held.

Our current workaround is to call Unlock() when our leadership is revoked but this wasn't obvious.

Is this behaviour expected? If so, it should be documented.

If not, I propose that the sessionRenew channel on the lock should be closed when leadership is revoked.

Reproduction steps

Launch a 3 node cluster. (We used docker) Client code should attempt to acquire a leadership and then wait for it to be revoked. Restart the consul server (docker restart work well) Once revoked, the client code should loop around and try to acquire leadership again. Instead, it waits forever and lock is not required.

If the consul servers are run with DEBUG logging you'll see that the old and new lock sessions are both being renewed.

theckman commented 8 years ago

@ryanslade you might want to update your code URLs to point to specific commits and not master. Your links may break if anyone changes those files.

Not sure if you know this trick, but if you press your y key while viewing source code in GitHub it'll change your URL to the specific commit. https://github.com/hashicorp/consul/blob/master/api/lock.go#L132 became https://github.com/hashicorp/consul/blob/d5b7530ec593f1ec2a8f8a7c145bcadafa88b572/api/lock.go#L132.

ryanslade commented 8 years ago

@theckman Thanks for the tip, done.

slackpad commented 8 years ago

Hi @ryanslade thanks for opening an issue. We should definitely document this and I'm not totally sure if we should change the behavior. It seems like we should stop updating the session whenever the leader channel closes, but I'd need to think through if there are any other implications if we change that.

shankarkc commented 8 years ago

Yes. I am also hitting this issue.

 docker version
Client:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:54:52 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:54:52 2016
 OS/Arch:      linux/amd64

i used below code to bring up my cluster master node

 "sudo docker run -d  --restart=unless-stopped -p 8500:8500 --name=consul progrium/consul -server -bootstrap"
 "sudo docker run -d  --restart=unless-stopped -p 4000:4000 swarm manage -H :4000 --replication --advertise ${aws_instance.swarm_master.0.private_ip}:4000 consul://${aws_instance.swarm_master.0.private_ip}:8500"

i reboot machine hosting swam cluster master. i see that swarm process is running in docker. But it not connectable

root@XXXXXXX:~# docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                                            NAMES
32c227fbab03        swarm               "/swarm join --advert"   13 hours ago        Up 13 hours         2375/tcp                                                                         romantic_joliot
f8427a36e1f4        swarm               "/swarm manage -H :40"   13 hours ago        Up 13 hours         2375/tcp, 0.0.0.0:4000->4000/tcp                                                 backstabbing_hoover
44df7d59752d        progrium/consul     "/bin/start -server -"   13 hours ago        Up 13 hours         53/tcp, 53/udp, 8300-8302/tcp, 8400/tcp, 8301-8302/udp, 0.0.0.0:8500->8500/tcp   consul

I cant list members

root@XXXXXXXX:~#  docker  run  swarm list consul://$IP:8500
time="2016-10-09T06:46:17Z" level=info msg="Initializing discovery without TLS"
2016/10/09 06:46:17 Unexpected response code: 500

Then i saw docker logs for the swarm container

root@XXXXXXXX:~# docker logs f8427a36e1f4
time="2016-10-09T06:35:34Z" level=info msg="Leader Election: Cluster leadership    lost"
time="2016-10-09T06:35:34Z" level=error msg="Unexpected response code: 500 (No cluster leader)"
time="2016-10-09T06:35:41Z" level=error msg="Discovery error: Unexpected response code: 500"
time="2016-10-09T06:35:41Z" level=error msg="Discovery error: Unexpected response code: 500 (No cluster leader)"
time="2016-10-09T06:35:41Z" level=error msg="Discovery error: Unexpected watch error"

I checked with docker team. They reported that its a consult issue. Can you pls fix this issue?

Thanks Shankar KC

vidhill commented 7 years ago

Seeing this too on Kubernetes (Minikube)