Open nabadger opened 6 years ago
There's some obvious information in our logs actually that helps:
[o.e.d.z.PublishClusterStateAction] [es-master-d4d46765-vh5tp] timed out waiting for all nodes to process published state [35] (timeout [30s], pending nodes: [{es-data-b479bcbd-brt64}{d2QLc-r1Qjy-XDuVSkXg1Q}{a0OoXCtsRryhwlq0wmyJDg}{10.244.2.4}{10.244.2.4:9300}{xpack.installed=true}])
This 30s timeout.
We think it's related to this: https://discuss.elastic.co/t/timed-out-waiting-for-all-nodes-to-process-published-state-and-cluster-unavailability/138590
Adding an extra sleep after trapping the sigterm seems to resolve the issue for us (see above merge if you're interested).
Hi,
I've been struggling to understand what's causing this, so wonder if you can offer any help. This is something that I can re-reproduce across various kubernetes-elasticsearch repo's (including the operators as well). It's also something I can re-reproduce on various clusters.
I'd really like to know if this is expected behaviour or not...
My Configuration:
I've setup ES using the example on the README (this is a 3 node kubernetes cluster running
v1.11.3
)This all works fine and brings up the ES cluster as expected.
I monitor the state of the ES master by execing into an ingestion pod (
kubectl exec ...
) and running:I then
kubectl exec
into the pod running the ES master and runkill 1
(the java process).This starts the master re-election process straight away, and typically a new master is elected in 2-3seconds (expected right?).
If on the otherhand, I delete the pod which is running the master (
kubectl delete pod <master pod>
), re-election always takes over 30 seconds.At this point the
cURL
command also hangs until the new master is available. I don't think this is expected right, as it essentially means the cluster is unavailable to use.I've also tried playing with various kubernetes pod-termination timeouts, along with the ES fault-detection timeouts, but can't seem to work around the problem.
Do you know if this is expected behaviour? If so, how do people actually upgrade the masters with a short-period of downtime? We also run ES outside of Kubernetes, and master re-election happens in under 3s (because we're essentially just doing SIGTERM on the parent process like
kill 1
) - hence I feel this is a Kubernetes thing.I've added 2 sets of logs
1 - Logs with
kill 1
on the ES java process2 - Logs with
kubectl delete pod
on the pod hosting the master ES instanceIn the set of of logs where we
kubectl delete pod
, it looks like master re-election happens twice.