OOM caused by servicegraph connector

wjh0914 commented 8 months ago

Component(s)

connector/servicegraph

What happened?

Description

we try to use servicegraph connector to generate service topo，and find OOM issue

when the otel collector starts，the memeroy keeps growing: oom

the profile shows the pmap takes lots of memory for servicegraph connector:

oom

top

Collector version

0.89.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
      otlp:
        protocols:
          grpc:
          http:
      jaeger:
        protocols:
          grpc:
          thrift_binary:
          thrift_compact:
          thrift_http:
      otlp/spanmetrics:
        protocols:
          grpc:
            endpoint: "0.0.0.0:12345"
      otlp/servicegraph:
        protocols:
          grpc:
            endpoint: "0.0.0.0:23456"
    exporters:
      logging:
        loglevel: info
      prometheus:
        endpoint: "0.0.0.0:8869"
        metric_expiration: 8760h
      prometheus/servicegraph:
        endpoint: "0.0.0.0:9090"
        metric_expiration: 8760h
      prometheusremotewrite:
        endpoint: 'http://vminsert-sample-vmcluster.svc.cluster.local:8480/insert/0/prometheus/api/v1/write'
        remote_write_queue:
          queue_size: 10000
          num_consumers: 5
        target_info:
          enabled: false
      otlp:
        endpoint: ats-sample-jaeger-collector.ranoss:4317
        tls:
          insecure: true
        sending_queue:
          enabled: true
          num_consumers: 20
          queue_size: 10000
    processors:
      transform:
        trace_statements:
          - context: resource
            statements:
            - replace_match(attributes["namespace"], "","unknownnamespace")
            - replace_match(attributes["apptenantname"], "","unknownapptenant")
            - replace_match(attributes["appname"], "","unknownapp")
            - replace_match(attributes["componentname"], "","unknowncomponent")
            - replace_match(attributes["podname"], "","unknownpod")
            - limit(attributes, 100, [])
            - truncate_all(attributes, 4096)
      resource:
        attributes:
          - key: apptenantname
            action: insert
            value: unknownapptenant
          - key: apptenantname
            action: update
            from_attribute: namespace
          - key: namespace
            action: insert
            value: unknownnamespace
          - key: componentname
            action: insert
            value: unknowncomponent
          - key: appname
            action: insert
            value: unknownapplication
          - key: podname
            action: insert
            value: unknownpod
      batch:
        send_batch_size: 200
        send_batch_max_size: 200
      filter/spans:
        traces:
          span:
            - 'kind != 2'
      filter/servicegraph:
        traces:
          span:
            - 'kind != 2 and kind != 3'
      spanmetrics:
        metrics_exporter: prometheus
        latency_histogram_buckets: [10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms,  2s, 3s, 5s, 8s, 10s,12s,20s,32s,1m, 2m, 5m,10m, 30m]
        dimensions:
          - name: namespace
          - name: http.method
          - name: http.status_code
          - name: appname
          - name: componentname
          - name: podname
        dimensions_cache_size: 20000000
        metrics_flush_interval: 29s
    extensions:
      pprof:
        endpoint: '0.0.0.0:1777'
    connectors: 
      servicegraph:
        latency_histogram_buckets: [10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms,  2s, 3s, 5s, 8s, 10s,12s,20s,32s,1m, 2m, 5m,10m, 30m]
        dimensions: [namespace,appname,componentname,podname]
        store:
          ttl: 120s
          max_items: 200000
        metrics_flush_interval: 59s
    service:
      pipelines:
        traces:
          receivers: [otlp, jaeger]
          processors: [resource, transform, batch]
          exporters: [otlp]
        metrics:
          receivers: [otlp]
          exporters: [prometheusremotewrite]
        metrics/spanmetrics:
          receivers: [otlp/spanmetrics]
          exporters: [prometheus]
        traces/spanmetrics:
          receivers: [otlp, jaeger]
          processors: [filter/spans,spanmetrics]
          exporters: [logging]
        metrics/servicegraph:
          receivers: [servicegraph]
          exporters: [prometheus/servicegraph]
        traces/servicegraph:
          receivers: [otlp, jaeger]
          processors: [filter/servicegraph]
          exporters: [servicegraph]
      extensions: [pprof]

Log output

No response

Additional context

No response

github-actions[bot] commented 8 months ago

Pinging code owners:

connector/servicegraph: @jpkrohling @mapno

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Frapschen commented 8 months ago

relate https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/29762

luistilingue commented 6 months ago

I'm with the same issue related to the serviceGraph connector :(

github-actions[bot] commented 4 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

connector/servicegraph: @jpkrohling @mapno

See Adding Labels via Comments if you do not have permissions to add labels yourself.

t00mas commented 4 months ago

Assign to me please, I'll have a look at this.

t00mas commented 4 months ago

Without deep diving into your detailed use case, there are a couple of things you can try:

servicegraph.store.max_items seems very high - do you really need 200k edges in your servicegraph awaiting completion?
have you tried setting GOMEMLIMIT?

rlankfo commented 3 months ago

@wjh0914 does this continue happening if you remove podname from dimensions?

Frapschen commented 3 months ago

you can check this metrics:

rate(otelcol_connector_servicegraph_total_edges[1m])
rate(otelcol_connector_servicegraph_expired_edges[1m])

It can help you get know about the edges in store.

t00mas commented 3 months ago

I've been testing this, with some mixed results.

With the same configs and also with a pick-and-choose from there, there seems to be a slow memory creep over time, so I was able to reproduce that in a limited way.

What's more interesting is that I think it's due to the GC not running as early as possible, or waiting too much to run. I was able to make the memory consumption stable using the GOGC and GOMEMLIMIT env vars, so I advise anyone to try that too.

This is probably also a case of giving more memory to the collector instance being counterproductive, because the default GOGC is 100, possibly waiting to fill-up before triggering GC runs.

tl;dr: Didn't find a clear mem leak, but a combination of env vars GOGC << 100 and GOMEMLIMIT as a soft-limit can trigger earlier GC runs and make the mem usage stable.

github-actions[bot] commented 1 month ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

connector/servicegraph: @jpkrohling @mapno @JaredTan95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

open-telemetry / opentelemetry-collector-contrib