kelseyhightower / consul-on-kubernetes

Running HashiCorp's Consul on Kubernetes
Apache License 2.0
601 stars 183 forks source link

Testing failing nodes does not restore the cluster.... #6

Open ajohnstone opened 7 years ago

ajohnstone commented 7 years ago

Testing failing nodes does not restore the cluster....

$ kubectl delete pods consul-2 consul-1;

HTTP error code from Consul: 500 Internal Server Error

This is an error page for the Consul web UI. You may have visited a URL that is loading an unknown resource, so you can try going back to the root.

Otherwise, please report any unexpected issues on the GitHub page.
$kubectl  exec --tty -i consul-0 -- consul members
Node      Address           Status  Type    Build  Protocol  DC
consul-0  100.96.4.13:8301  alive   server  0.7.2  2         dc1
consul-1  100.96.7.6:8301   alive   server  0.7.2  2         dc1
consul-2  100.96.6.12:8301  alive   server  0.7.2  2         dc1

$ kubectl get pods -o wide
NAME           READY     STATUS    RESTARTS   AGE       IP            NODE
consul-0       1/1       Running   0          7h        100.96.4.13   ip-10-117-89-126.eu-west-1.compute.internal
consul-1       1/1       Running   0          8m        100.96.7.6    ip-10-117-97-131.eu-west-1.compute.internal
consul-2       1/1       Running   0          8h        100.96.6.12   ip-10-117-37-128.eu-west-1.compute.internal
docker-debug   1/1       Running   0          10h       100.96.6.2    ip-10-117-37-128.eu-west-1.compute.internal

$ kubectl  exec --tty -i consul-0 -- consul operator raft -list-peers
Operator "raft" subcommand failed: Unexpected response code: 500 (No cluster leader)

$ kubectl  exec --tty -i consul-0 -- consul members
Node      Address           Status  Type    Build  Protocol  DC
consul-0  100.96.4.13:8301  alive   server  0.7.2  2         dc1
consul-1  100.96.7.6:8301   alive   server  0.7.2  2         dc1
consul-2  100.96.6.12:8301  alive   server  0.7.2  2         dc1

$ kubectl  exec --tty -i consul-0 -- consul monitor
...
2017/01/20 10:50:59 [WARN] raft: Election timeout reached, restarting election
2017/01/20 10:50:59 [INFO] raft: Node at 100.96.4.13:8300 [Candidate] entering Candidate state in term 4324
2017/01/20 10:50:59 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.10:8300 100.96.6.10:8300}: dial tcp 100.96.6.10:8300: getsockopt: no route to host
2017/01/20 10:50:59 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.11:8300 100.96.6.11:8300}: dial tcp 100.96.6.11:8300: getsockopt: no route to host
2017/01/20 10:50:59 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.4.7:8300 100.96.4.7:8300}: dial tcp 100.96.4.7:8300: getsockopt: no route to host
2017/01/20 10:51:01 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.7.5:8300 100.96.7.5:8300}: dial tcp 100.96.7.5:8300: getsockopt: no route to host
2017/01/20 10:51:05 [WARN] raft: Election timeout reached, restarting election
2017/01/20 10:51:05 [INFO] raft: Node at 100.96.4.13:8300 [Candidate] entering Candidate state in term 4325
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.4.7:8300 100.96.4.7:8300}: dial tcp 100.96.4.7:8300: getsockopt: no route to host
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.10:8300 100.96.6.10:8300}: dial tcp 100.96.6.10:8300: getsockopt: no route to host
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.6.11:8300 100.96.6.11:8300}: dial tcp 100.96.6.11:8300: getsockopt: no route to host
2017/01/20 10:51:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 100.96.7.5:8300 100.96.7.5:8300}: dial tcp 100.96.7.5:8300: getsockopt: no route to host
2017/01/20 10:51:12 [INFO] agent.rpc: Accepted client: 127.0.0.1:42080
...
santinoncs commented 7 years ago

Hi

Happened to me in a cluster GKE upgrade version process. I got a consul cluster deployed. I performed the gke upgrade available process from 1.5.2 to 1.5.3. As it restarts the nodes one by one, there were two pods in the same node. What happened? The consensus was broken and I got the same error

HTTP error code from Consul: 500 Internal Server Error

santinoncs commented 7 years ago

In order to avoid a downtime in the consul cluster when performing an upgrade of version in GKE, I modified the statefuleset with this

  lifecycle:
        preStop:
          exec:
            command:
            - /bin/sh
            - -c
            - consul leave

with this, if a pod is evicted from a node, if will leave the cluster gracefully.

Also I add a PodDisruptionBudget with a minAvailable of 2. So, the drain will wait until this is accomplish.

combatpoodle commented 6 years ago

Just ran a quick test on GKE off of PR #34 which is pretty close to mainline, just consul 1.2 instead of 0.9.1. kill -9 on all the agents results in them getting brought back up - different hosts, but alive and sync'd just the same.