jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.5k stars 2.44k forks source link

Jaeger is OOMKilled when use badger as storage #2987

Closed Sallyan closed 3 years ago

Sallyan commented 3 years ago

version: all-in-one:1.18.1 issue: when i use badge as storage, Jaeger requests much more memory than in-memory storage and it keeps OOMKiller. At the beginning Jaeger requests just hundred MB memory when it uses in-memory storage. But it requests over 5Gi memory when changed to Badge. Jaeger yaml:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: tracing-jaeger
  namespace: kyma-system
spec:
  agent:
    config: {}
    options: {}
    resources: {}
  allInOne:
    config: {}
    image: eu.gcr.io/kyma-project/external/jaegertracing/all-in-one:1.18.1
    options:
      log-level: info
    resources: {}
  annotations:
    sidecar.istio.io/inject: "true"
    sidecar.istio.io/rewriteAppHTTPProbers: "true"
  collector:
    config: {}
    options: {}
    resources: {}
  ingester:
    config: {}
    options: {}
    resources: {}
  ingress:
    enabled: false
    openshift: {}
    options: {}
    resources: {}
    security: none
  query:
    options: {}
    resources: {}
  resources:
    limits:
      cpu: 500m
      memory: 8Gi
    requests:
      cpu: 200m
      memory: 6Gi
  sampling:
    options: {}
  storage:
    cassandraCreateSchema: {}
    dependencies:
      resources: {}
      schedule: 55 23 * * *
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      resources:
        limits:
          memory: 16Gi
        requests:
          cpu: "1"
          memory: 16Gi
      storage: {}
    esIndexCleaner:
      numberOfDays: 7
      resources: {}
      schedule: 55 23 * * *
    esRollover:
      resources: {}
      schedule: 0 0 * * *
    options:
      badger:
        directory-key: /badger/key
        directory-value: /badger/data
        ephemeral: false
        span-store-ttl: 24h
        truncate: true
      cassandra:
        keyspace: jaeger_v1_datacenter3
        servers: cassandra.default.svc
      es:
        server-urls: http://elasticsearch-client.default.svc:9200
      memory:
        max-traces: 10000
    type: badger
  strategy: allinone
  ui:
    options:
      dependencies:
        menuEnabled: true
      menu:
      - items:
        - label: Documentation
          url: https://www.jaegertracing.io/docs/latest
        label: About Jaeger
      - items:
        - label: Documentation
          url: https://kyma-project.io/docs/components/tracing/
        label: About Kyma
  volumeMounts:
  - mountPath: /badger
    name: data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: jaeger-pvc
Sallyan commented 3 years ago
Screen Shot 2021-05-11 at 4 02 42 PM

The memory usage increased from MB to Gi when changes from in-memory storage to badge.

jpkrohling commented 3 years ago

I think this isn't related to the operator, but more general to Jaeger and its usage of Badger.

jpkrohling commented 3 years ago

We have a few changes in the queue for badger as storage for Jaeger, but it might take a while for us to work on it. If this is critical to you, I can point you to the places in the code for you to take a look at.

Sallyan commented 3 years ago

we really want to have persistent data for the tracing. Do you have any suggestion or kind of wordaround for Badger? Yeah, please also point me the code. Thanks!

jpkrohling commented 3 years ago

we really want to have persistent data for the tracing.

You can use Cassandra or Elasticsearch for that, which are actually recommended if you need to scale your deployment...

In any case, here's the badger code: https://github.com/jaegertracing/jaeger/tree/master/plugin/storage/badger

jpkrohling commented 3 years ago

This should have been fixed by #3096. If you are still experiencing this, feel free to reopen.

rmannibucau commented 1 year ago

I still have this behavior with all in one image v1.49, weird thing is the OOM happens at startup with no load at all (probably when badger is getting reopened).

Edit: it seems /keys are not evicted whereas /data is almost empty:

12G /opt/jaeger-badger/key
16K /opt/jaeger-badger/data

Mystery solved: spring-boot with hikari+sleuth will trigger spans even for connection.isValid() which lead to filling badger too quickly if timeout and pool size are high enough, sorry for the noise.