Open ajohnstone opened 7 years ago
Hi
Happened to me in a cluster GKE upgrade version process. I got a consul cluster deployed. I performed the gke upgrade available process from 1.5.2 to 1.5.3. As it restarts the nodes one by one, there were two pods in the same node. What happened? The consensus was broken and I got the same error
HTTP error code from Consul: 500 Internal Server Error
In order to avoid a downtime in the consul cluster when performing an upgrade of version in GKE, I modified the statefuleset with this
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- consul leave
with this, if a pod is evicted from a node, if will leave the cluster gracefully.
Also I add a PodDisruptionBudget with a minAvailable of 2. So, the drain will wait until this is accomplish.
Just ran a quick test on GKE off of PR #34 which is pretty close to mainline, just consul 1.2 instead of 0.9.1. kill -9 on all the agents results in them getting brought back up - different hosts, but alive and sync'd just the same.
Testing failing nodes does not restore the cluster....
$ kubectl delete pods consul-2 consul-1;