Leader Elector HTTP server mismatch

fredrik-jansson-se commented 6 years ago

I have three pods that I distribute on three nodes using anti affinity rules.

To test the leader election:

I bring the nodes up,
cordons the leader pod's node (to buy some time to force the other pods to select a new leader)
kills the leader pod

According to the logs, the two remaining pods agrees upon a new leader.

Problem: One of the pod's leader-elector http server still returns the old leader.

I can consistently reproduce this on my cluster.

Kubernetes is v1.11.1 leader-elector is v0.5 (I also tested with v0.4 with the same results).

Given the widespread use of the leader-elector, I assume I do something wrong... but I really cannot figure out what.

frjansso@kube-1:/mnt/kube-ha/nso-ha-test$ kubectl get pod
NAME      READY     STATUS    RESTARTS   AGE
nso-0     2/2       Running   0          37m
nso-1     2/2       Running   0          37m
nso-2     2/2       Running   0          37m
frjansso@kube-1:/mnt/kube-ha/nso-ha-test$ kubectl exec -it nso-0 bash
Defaulting container name to nso-master.
Use 'kubectl describe pod/nso-0 -n default' to see all of the containers in this pod.
root@nso-0:/# curl http://localhost:4040
{"name":"nso-0"}
root@nso-0:/# exit
exit
frjansso@kube-1:/mnt/kube-ha/nso-ha-test$ kubectl exec -it nso-2 bash
Defaulting container name to nso-master.
Use 'kubectl describe pod/nso-2 -n default' to see all of the containers in this pod.
root@nso-2:/# curl http://localhost:4040
{"name":"nso-0"}
root@nso-2:/#
root@nso-2:/# exit
command terminated with exit code 127
frjansso@kube-1:/mnt/kube-ha/nso-ha-test$ kubectl get pod -o=wide
NAME      READY     STATUS    RESTARTS   AGE       IP                NODE
nso-0     2/2       Running   0          40m       192.168.89.225    kube-3
nso-1     2/2       Running   0          40m       192.168.79.237    kube-2
nso-2     2/2       Running   0          40m       192.168.126.121   kube-1
frjansso@kube-1:/mnt/kube-ha/nso-ha-test$ kubectl cordon kube-3
node/kube-3 cordoned
frjansso@kube-1:/mnt/kube-ha/nso-ha-test$ kubectl delete pod nso-0
pod "nso-0" deleted

frjansso@kube-1:/mnt/kube-ha/nso-ha-test$ kubectl logs nso-1 elector
....
I0806 16:45:05.408266       8 leaderelection.go:296] lock is held by nso-0 and has not yet expired
I0806 16:45:09.796249       8 leaderelection.go:296] lock is held by nso-0 and has not yet expired
**nso-1 is the leader**
I0806 16:45:14.161264       8 leaderelection.go:215] sucessfully acquired lease default/nso-svc
frjansso@kube-1:/mnt/kube-ha/nso-ha-test$ kubectl logs nso-2 elector
....
I0806 16:45:12.120216       7 leaderelection.go:296] lock is held by nso-0 and has not yet expired
I0806 16:45:16.479642       7 leaderelection.go:296] lock is held by nso-1 and has not yet expired

frjansso@kube-1:/mnt/kube-ha/nso-ha-test$ kubectl exec -it nso-1 bash
\Defaulting container name to nso-master.
Use 'kubectl describe pod/nso-1 -n default' to see all of the containers in this pod.
\root@nso-1:/# curl http://localhost:4040
{"name":"nso-1"}
root@nso-1:/# exit
exit
frjansso@kube-1:/mnt/kube-ha/nso-ha-test$ kubectl exec -it nso-2 bash
Defaulting container name to nso-master.
Use 'kubectl describe pod/nso-2 -n default' to see all of the containers in this pod.
root@nso-2:/# curl http://localhost:4040
{"name":"nso-0"}
root@nso-2:/#

fredrik-jansson-se commented 6 years ago

Duplicate of https://github.com/kubernetes/contrib/issues/2930

fredrik-jansson-se commented 6 years ago

I pull the latest contrib code, and rebuilt the container manually... cannot reproduce anymore, so my guess is that this has been fixed, but no new container pushed.

Available here: https://hub.docker.com/r/fredrikjanssonse/leader-elector/tags/

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/contrib/issues/2933#issuecomment-451336351): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-retired / contrib

Leader Elector HTTP server mismatch #2933