open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

Jaeger transform to otlp case high Memory, OOM #11457

Closed chenhaipeng closed 2 years ago

chenhaipeng commented 2 years ago

Describe the bug Use jaeger agent to collect, send to Opentelemetry Collector through port 14250, observe for a while, and find that Opentelemetry Collector OOM is triggered

Steps to reproduce Use jaeger agent to collect ,send to Opentelemetry Collector through port 14250.

What did you expect to see? A clear and concise description of what you expected to see.

What did you see instead? i dump the heap use pprof, top

      flat  flat%   sum%        cum   cum%
 1330.38MB 23.06% 23.06%  1330.38MB 23.06%  go.opentelemetry.io/collector/pdata/internal.SpanSlice.AppendEmpty
 1190.78MB 20.64% 43.71%  1190.78MB 20.64%  go.opentelemetry.io/collector/pdata/internal.Map.EnsureCapacity
  801.48MB 13.89% 57.60%   801.48MB 13.89%  go.opentelemetry.io/collector/pdata/internal/data/protogen/trace/v1.(*TracesData).Marshal
  620.01MB 10.75% 68.35%   620.01MB 10.75%  github.com/jaegertracing/jaeger/model.(*KeyValue).Unmarshal
  480.81MB  8.34% 76.68%   480.81MB  8.34%  google.golang.org/grpc/internal/transport.newBufWriter
     259MB  4.49% 81.17%      259MB  4.49%  go.opentelemetry.io/collector/pdata/internal.Value.SetStringVal
  237.75MB  4.12% 85.30%   237.75MB  4.12%  bufio.NewReaderSize
  236.61MB  4.10% 89.40%   853.13MB 14.79%  github.com/jaegertracing/jaeger/model.(*Span).Unmarshal
  169.66MB  2.94% 92.34%   169.66MB  2.94%  bytes.makeSlice
  106.50MB  1.85% 94.19%   106.50MB  1.85%  go.opentelemetry.io/collector/pdata/internal.Value.SetIntVal

  (pprof) list .AppendEmpty
Total: 5.63GB
ROUTINE ======================== go.opentelemetry.io/collector/pdata/internal.ResourceSpansSlice.AppendEmpty in go.opentelemetry.io/collector/pdata@v0.53.0/internal/generated_ptrace.go
   10.50MB    10.50MB (flat, cum)  0.18% of Total
 Error: could not find file go.opentelemetry.io/collector/pdata@v0.53.0/internal/generated_ptrace.go on path /data/home/severnchen
ROUTINE ======================== go.opentelemetry.io/collector/pdata/internal.ScopeSpansSlice.AppendEmpty in go.opentelemetry.io/collector/pdata@v0.53.0/internal/generated_ptrace.go
       8MB        8MB (flat, cum)  0.14% of Total
 Error: could not find file go.opentelemetry.io/collector/pdata@v0.53.0/internal/generated_ptrace.go on path /data/home/severnchen
ROUTINE ======================== go.opentelemetry.io/collector/pdata/internal.SpanSlice.AppendEmpty in go.opentelemetry.io/collector/pdata@v0.53.0/internal/generated_ptrace.go
    1.30GB     1.30GB (flat, cum) 23.06% of Total
 Error: could not find file go.opentelemetry.io/collector/pdata@v0.53.0/internal/generated_ptrace.go on path /data/home/severnchen

What version did you use? Version: v0.53.0

What config did you use? Config: (e.g. the yaml config file)


  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
            cors:
              allowed_origins:
                - "*"
              allowed_headers:
                - "*"
              max_age: 7200

      jaeger:
        protocols:
          grpc:
          thrift_binary:
          thrift_compact:
          thrift_http:

      zipkin:

    exporters:
      logging:
        loglevel: error

      kafka/traces:
        brokers:
          - localhost:9092
        topic: test_traces
        producer:
          #1mb
          max_message_bytes: 10000000
        protocol_version: 2.0.0
        metadata:
          retry:
            max: 0
        timeout: 5s
        retry_on_failure:
          enabled: false
        sending_queue:
          enabled: true
          num_consumers: 50
          queue_size: 100000

      kafka/logs:
        brokers:
          - localhost:9092
        topic: test_log
        producer:
          #1mb
          max_message_bytes: 10000000
        protocol_version: 2.0.0
        metadata:
          retry:
            max: 0
        timeout: 5s
        retry_on_failure:
          enabled: false
        sending_queue:
          enabled: true
          num_consumers: 50
          queue_size: 100000

    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 80
        spike_limit_percentage: 30
      batch:
        send_batch_size: 128
        timeout: 100ms

    extensions:
      health_check:
      pprof:
        endpoint: :1888
      zpages:
        endpoint: :55679
      memory_ballast:
        size_in_percentage: 20

    service:
      extensions: [ pprof, zpages, health_check ]
      pipelines:
        traces:
          receivers: [ otlp, jaeger, zipkin ]
          processors: [ batch ]
          exporters: [ kafka/traces ]
        logs:
          receivers: [ otlp ]
          processors: [ batch ]
          exporters: [ kafka/logs ]
        metrics:
          receivers: [ otlp ]
          processors: [ batch ]
          exporters: [ loggi

Environment OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

Additional context Add any other context about the problem here.

dmitryax commented 2 years ago

cc @jpkrohling as code owner

dmitryax commented 2 years ago

I may be able to take a look later

jpkrohling commented 2 years ago

@dmitryax, did you have a chance to look at it? Should I add this to my queue?

dmitryax commented 2 years ago

@jpkrohling I don't have enough cycles to take a look into this this month. Feel free to take it if you can pick it up sooner

jpkrohling commented 2 years ago

I'll add this to my queue, which is already quite lengthy, so, I might not be able to work on it right away.

jpkrohling commented 2 years ago

I wasn't able to reproduce this yet. If the problem is with the Jaeger receiver, it should have been evident with this configuration:

receivers:
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250

exporters:
  logging:

processors:

extensions:
  pprof:

service:
  extensions: [pprof]
  pipelines:
    traces:
      receivers: [jaeger]
      processors:
      exporters: [logging]

I ran Jaeger's tracegen for 5 minutes and 10ms pause, and then for 5 more minutes with 1ms pause and nothing suspicious happened:

image

image

If you have more information on how to reproduce this issue, let me know and I'll continue looking into this. Otherwise, I'll close this in a few days, as I don't have enough information to continue investigating.

jpkrohling commented 2 years ago

I'm closing this as I don't have enough information to keep working on this.