jaegertracing / jaeger-kubernetes

Support for deploying Jaeger into Kubernetes
https://jaegertracing.io/
Apache License 2.0
449 stars 158 forks source link

Elasticsearch in production always Back-off restarting failed container #84

Open chalvern opened 6 years ago

chalvern commented 6 years ago

elasticsearch version:

docker.elastic.co/elasticsearch/elasticsearch:5.6.0

k8s cluster version

1.10

describe

# kubectl describe pods -n jaeger  elasticsearch-0

Name:           elasticsearch-0
Namespace:      jaeger
Node:           node-1/192.168.205.128
Start Time:     Sat, 28 Apr 2018 16:44:35 +0800
Labels:         app=jaeger-elasticsearch
                controller-revision-hash=elasticsearch-8684f69799
                jaeger-infra=elasticsearch-replica
                statefulset.kubernetes.io/pod-name=elasticsearch-0
Annotations:    <none>
Status:         Running
IP:             192.168.3.197
Controlled By:  StatefulSet/elasticsearch
Containers:
  elasticsearch:
    Container ID:  docker://941824d0c9186862372c793d41d578a5e34c0972c877771d00629dc375593530
    Image:         docker.elastic.co/elasticsearch/elasticsearch:5.6.0
    Image ID:      docker-pullable://docker.elastic.co/elasticsearch/elasticsearch@sha256:f95e7d4256197a9bb866b166d9ad37963dc7c5764d6ae6400e551f4987a659d7
    Port:          <none>
    Host Port:     <none>
    Command:
      bin/elasticsearch
    Args:
      -Ehttp.host=0.0.0.0
      -Etransport.host=127.0.0.1
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Sat, 28 Apr 2018 16:50:57 +0800
      Finished:     Sat, 28 Apr 2018 16:50:57 +0800
    Ready:          False
    Restart Count:  6
    Readiness:      exec [curl --fail --silent --output /dev/null --user elastic:changeme localhost:9200] delay=5s timeout=4s period=5s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8l8qt (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  data:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  default-token-8l8qt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-8l8qt
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                 Age               From                 Message
  ----     ------                 ----              ----                 -------
  Normal   Scheduled              7m                default-scheduler    Successfully assigned elasticsearch-0 to node-1
  Normal   SuccessfulMountVolume  7m                kubelet, node-1  MountVolume.SetUp succeeded for volume "data"
  Normal   SuccessfulMountVolume  7m                kubelet, node-1  MountVolume.SetUp succeeded for volume "default-token-8l8qt"
  Normal   Pulling                6m (x4 over 7m)   kubelet, node-1  pulling image "docker.elastic.co/elasticsearch/elasticsearch:5.6.0"
  Normal   Pulled                 6m (x4 over 7m)   kubelet, node-1  Successfully pulled image "docker.elastic.co/elasticsearch/elasticsearch:5.6.0"
  Normal   Created                6m (x4 over 7m)   kubelet, node-1  Created container
  Normal   Started                6m (x4 over 7m)   kubelet, node-1  Started container
  Warning  BackOff                2m (x22 over 7m)  kubelet, node-1  Back-off restarting failed container

log

# kubectl logs -n jaeger  elasticsearch-0
# nothing shown.
pavolloffay commented 6 years ago

@chalvern hi, did you manage to solve it? As there are no log's it's hard to find out what caused the issue.

chalvern commented 6 years ago

@pavolloffay I am afraid not, but possibly be source limit, as my k8s cluster is setted on 2 vm machines, each with 2cpu/2Gmemory. I will check it after in my free time.

chalvern commented 6 years ago

As what I said, out of memory...

May  3 21:27:03 xxx-1 kernel: [74354.386802] Out of memory: Kill process 35184 (java) score 1621 or sacrifice child
May  3 21:27:03 xxx-1 kernel: [74354.387300] Killed process 35184 (java) total-vm:2599788kB, anon-rss:1262648kB, file-rss:0kB
pavolloffay commented 6 years ago

Then it's environment issue, I will close it. If anything pops up up feel free to reopen.

chalvern commented 6 years ago

Finally, my solution is to add the following ENV config to elasticsearch.yml

env:
  - name: ES_JAVA_OPTS
     value: -Xms256m -Xmx512m
  - name: bootstrap.memory_lock
     value: "true"
jpkrohling commented 6 years ago

I'm reopening this, so that we apply @chalvern's env vars to elasticsearch.yml.

jpkrohling commented 6 years ago

@chalvern would you be interested in contributing a fix to this?

pavolloffay commented 6 years ago

-Xms256m -Xmx512m seems very low for Elasticsearch. For example openshift logging uses 8gb by default

pavolloffay commented 6 years ago

I am also making a pointer do docs for bootstrap.memory_lock https://www.elastic.co/guide/en/elasticsearch/reference/master/setup-configuration-memory.html#bootstrap-memory_lock

chalvern commented 6 years ago

@jpkrohling I worry the -Xms256m -Xmx512m is too low to use in production, just as @pavolloffay mentioned. The yaml of Elasticsearch in production seems actually like in test instead of in production.

What I suggest is to take it as test. In production, there should be replica of Elasticsearch or called cluster.