Open ryanslade opened 8 years ago
@ryanslade you might want to update your code URLs to point to specific commits and not master. Your links may break if anyone changes those files.
Not sure if you know this trick, but if you press your y
key while viewing source code in GitHub it'll change your URL to the specific commit. https://github.com/hashicorp/consul/blob/master/api/lock.go#L132
became https://github.com/hashicorp/consul/blob/d5b7530ec593f1ec2a8f8a7c145bcadafa88b572/api/lock.go#L132
.
@theckman Thanks for the tip, done.
Hi @ryanslade thanks for opening an issue. We should definitely document this and I'm not totally sure if we should change the behavior. It seems like we should stop updating the session whenever the leader channel closes, but I'd need to think through if there are any other implications if we change that.
Yes. I am also hitting this issue.
docker version
Client:
Version: 1.10.3
API version: 1.22
Go version: go1.5.3
Git commit: 20f81dd
Built: Thu Mar 10 15:54:52 2016
OS/Arch: linux/amd64
Server:
Version: 1.10.3
API version: 1.22
Go version: go1.5.3
Git commit: 20f81dd
Built: Thu Mar 10 15:54:52 2016
OS/Arch: linux/amd64
i used below code to bring up my cluster master node
"sudo docker run -d --restart=unless-stopped -p 8500:8500 --name=consul progrium/consul -server -bootstrap"
"sudo docker run -d --restart=unless-stopped -p 4000:4000 swarm manage -H :4000 --replication --advertise ${aws_instance.swarm_master.0.private_ip}:4000 consul://${aws_instance.swarm_master.0.private_ip}:8500"
i reboot machine hosting swam cluster master. i see that swarm process is running in docker. But it not connectable
root@XXXXXXX:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
32c227fbab03 swarm "/swarm join --advert" 13 hours ago Up 13 hours 2375/tcp romantic_joliot
f8427a36e1f4 swarm "/swarm manage -H :40" 13 hours ago Up 13 hours 2375/tcp, 0.0.0.0:4000->4000/tcp backstabbing_hoover
44df7d59752d progrium/consul "/bin/start -server -" 13 hours ago Up 13 hours 53/tcp, 53/udp, 8300-8302/tcp, 8400/tcp, 8301-8302/udp, 0.0.0.0:8500->8500/tcp consul
I cant list members
root@XXXXXXXX:~# docker run swarm list consul://$IP:8500
time="2016-10-09T06:46:17Z" level=info msg="Initializing discovery without TLS"
2016/10/09 06:46:17 Unexpected response code: 500
Then i saw docker logs for the swarm container
root@XXXXXXXX:~# docker logs f8427a36e1f4
time="2016-10-09T06:35:34Z" level=info msg="Leader Election: Cluster leadership lost"
time="2016-10-09T06:35:34Z" level=error msg="Unexpected response code: 500 (No cluster leader)"
time="2016-10-09T06:35:41Z" level=error msg="Discovery error: Unexpected response code: 500"
time="2016-10-09T06:35:41Z" level=error msg="Discovery error: Unexpected response code: 500 (No cluster leader)"
time="2016-10-09T06:35:41Z" level=error msg="Discovery error: Unexpected watch error"
I checked with docker team. They reported that its a consult issue. Can you pls fix this issue?
Thanks Shankar KC
Seeing this too on Kubernetes (Minikube)
consul version
for both Client and ServerClient:
Go API Client, HEAD
Server:0.7.0
Operating system and Environment details
Ubuntu Linux 14.04
Description of the Issue (and unexpected/desired result)
We are using client side leader election (https://github.com/hashicorp/consul/blob/d5b7530ec593f1ec2a8f8a7c145bcadafa88b572/api/lock.go#L132)
When leadership is acquired, two goroutines are created: https://github.com/hashicorp/consul/blob/d5b7530ec593f1ec2a8f8a7c145bcadafa88b572/api/lock.go#L151 https://github.com/hashicorp/consul/blob/d5b7530ec593f1ec2a8f8a7c145bcadafa88b572/api/lock.go#L239
When the current consul server leader is restarted, the monitor goroutine fails I assume because it is querying in consistent mode. This signals to the client that leadership was revoked. I would assume in this case to try and acquire leadership again since this is a blocking process. However, the renew goroutine started earlier continues to renew the old session which means that the lock appears to be held.
Our current workaround is to call Unlock() when our leadership is revoked but this wasn't obvious.
Is this behaviour expected? If so, it should be documented.
If not, I propose that the sessionRenew channel on the lock should be closed when leadership is revoked.
Reproduction steps
Launch a 3 node cluster. (We used docker) Client code should attempt to acquire a leadership and then wait for it to be revoked. Restart the consul server (docker restart work well) Once revoked, the client code should loop around and try to acquire leadership again. Instead, it waits forever and lock is not required.
If the consul servers are run with DEBUG logging you'll see that the old and new lock sessions are both being renewed.