Closed ivanovaleksandar closed 6 years ago
Does kubectl describe po/es-data-0
shows any kind of timeouts / issues during startup observed by kubernetes?
Eventually the pod gets killed because it does not startup fast enough to answer to the livenessProbe.
Additionally it's not a good idea to allocate almost all of the available memory for the java heap:
[...]
- name: ES_JAVA_OPTS
value: -Xms3g -Xmx3g
[...]
VS
[...]
resources:
requests:
memory: 3Gi
limits:
cpu: 2
memory: 4Gi
[...]
Per elastic documentation
Set Xmx to no more than 50% of your physical RAM, to ensure that there is enough physical RAM left for kernel file system caches. https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
I've changed the ES_JAVA_OPTS to be 50% of the limits (and different variations of those parameters), but still the same crash loop.
Also, I don't see anything strange in the events log, besides the readiness probe failure when trying to reach _cluster/health
endpoint, but that is because the init process has not came to the point to start up the service successfully.
kubectl describe pod es-data-0
Name: es-data-0
Namespace: default
Node: kubernetes-node1/<ip-addr>
Start Time: Mon, 09 Jul 2018 10:30:02 +0200
Labels: component=elasticsearch
controller-revision-hash=es-data-776697d896
role=data
statefulset.kubernetes.io/pod-name=es-data-0
Annotations: <none>
Status: Running
IP: 10.44.0.0
Controlled By: StatefulSet/es-data
Init Containers:
init-sysctl:
Container ID: docker://320600efc4f4e2450933de60300b04b62fc442b422f55db0636c42ace9750115
Image: busybox:1.27.2
Image ID: docker-pullable://busybox@sha256:bbc3a03235220b170ba48a157dd097dd1379299370e1ed99ce976df0355d24f0
Port: <none>
Host Port: <none>
Command:
sysctl
-w
vm.max_map_count=262144
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 09 Jul 2018 10:30:03 +0200
Finished: Mon, 09 Jul 2018 10:30:03 +0200
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-875d5 (ro)
Containers:
es-data:
Container ID: docker://0ab42eec948bfb61180ff55d9c431e9cbc7afb719fe47a2b76c716e6f0a726cc
Image: quay.io/pires/docker-elasticsearch-kubernetes:6.3.0
Image ID: docker-pullable://quay.io/pires/docker-elasticsearch-kubernetes@sha256:dcd3e9db3d2c6b9a448d135aebcacac30a4cca655d42efaa115aa57405cd22f3
Ports: 9200/TCP, 9300/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Mon, 09 Jul 2018 10:30:49 +0200
Last State: Terminated
Reason: Error
Exit Code: 143
Started: Mon, 09 Jul 2018 10:30:05 +0200
Finished: Mon, 09 Jul 2018 10:30:49 +0200
Ready: False
Restart Count: 1
Limits:
cpu: 2
memory: 4Gi
Requests:
cpu: 2
memory: 2Gi
Liveness: tcp-socket :transport delay=20s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:http/_cluster/health delay=20s timeout=5s period=10s #success=1 #failure=3
Environment:
NAMESPACE: default (v1:metadata.namespace)
NODE_NAME: es-data-0 (v1:metadata.name)
CLUSTER_NAME: myesdb
NODE_MASTER: false
NODE_INGEST: false
HTTP_ENABLE: true
ES_JAVA_OPTS: -Xms2g -Xmx2g
PROCESSORS: 2 (limits.cpu)
Mounts:
/data from storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-875d5 (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: storage-es-data-0
ReadOnly: false
default-token-875d5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-875d5
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned es-data-0 to kubernetes-node1
Normal SuccessfulMountVolume 1m kubelet, kubernetes-node1 MountVolume.SetUp succeeded for volume "default-token-875d5"
Normal SuccessfulMountVolume 1m kubelet, kubernetes-node1 MountVolume.SetUp succeeded for volume "pvc-306d82b3-805e-11e8-ac5c-7625bd182864"
Normal Pulled 59s kubelet, kubernetes-node1 Container image "busybox:1.27.2" already present on machine
Normal Created 59s kubelet, kubernetes-node1 Created container
Normal Started 59s kubelet, kubernetes-node1 Started container
Warning Unhealthy 19s (x2 over 29s) kubelet, kubernetes-node1 Readiness probe failed: Get http://10.44.0.0:9200/_cluster/health: dial tcp 10.44.0.0:9200: getsockopt: connection refused
Warning Unhealthy 14s (x3 over 34s) kubelet, kubernetes-node1 Liveness probe failed: dial tcp 10.44.0.0:9300: getsockopt: connection refused
Normal Pulled 13s (x2 over 57s) kubelet, kubernetes-node1 Container image "quay.io/pires/docker-elasticsearch-kubernetes:6.3.0" already present on machine
Normal Created 13s (x2 over 57s) kubelet, kubernetes-node1 Created container
Normal Started 13s (x2 over 57s) kubelet, kubernetes-node1 Started container
Normal Killing 13s kubelet, kubernetes-node1 Killing container with id docker://es-data:Container failed liveness probe.. Container will be killed and recreated.
But you can see that, like I already assumed, the liveness
probe is failing and therefore killing the container :-/
Warning Unhealthy 14s (x3 over 34s) kubelet, kubernetes-node1 Liveness probe failed: dial tcp 10.44.0.0:9300: getsockopt: connection refused
Normal Killing 13s kubelet, kubernetes-node1 Killing container with id docker://es-data:Container failed liveness probe.. Container will be killed and recreated.
Could you try to increase the initialDelaySeconds
of the livenessProbe to like 2 or 5 minutes or so to see if it then will come up? Afterwards you can reduce the delay to a more close value to the required start time.
Hey, that worked. :) That was a trivial overlook from my side. But, as the cluster gets in some data to process and startup the services, the readiness/liveliness need to be adjusted probably.
Thank you @mat1010 ! I will close the issue now.
I'm so glad other people have these issues before me
@ivanovaleksandar it appears your ElasticSearch data node is running with GlusterFS persistent storage. Do you run into issue related to CorruptIndexException errors?
@kcao3 Yes, I did ran into those that particular issue and I switched to Rook (it is CephFS block based storage solution).
The data node is constantly restarting in the initialization phase after working well for few days.
After this it crashes and starts a new init once again.
This the the yaml that I am using. I've set up requests and limits accordingly (the java heap is smaller that the pod limits, as people suggested), but with no avail.
Any idea or suggestion?