OTel Collector potential memory leak

mxiamxia commented 4 years ago

I have the collector setup with the following pipeline config and I am sending about 500 spans requests per second to Jaeger Receiver in the collector. The collector heap size starts growing after running for 10 - 20 minutes and the collector instance will crash at the end. From the profiling, I can see the span data buffered in jaeger.jSpansToOCProtoSpans(jspans []*jaeger.Span). I also tried to use Zipkin Receiver and has the same problem that the data buffered in zipkinreceiver.zipkinTagsToTraceAttributes.

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [opencensus, jaeger, zipkin]
      exporters: [jaeger, zipkin]
      processors: [batch, queued_retry]
    metrics:
      receivers: [opencensus]
      exporters: [prometheus]

tigrannajaryan commented 4 years ago

Can you also attach Collector logs?

Do you have both Jaeger and Zipkin backends up and running and accepting from Collector?

mxiamxia commented 4 years ago

Hi Tigran, I have both Jaeger and Zipkin backends running in the separate containers(see attached picture). I am seeing the following errors in the logs. I guess the root cause could be when the backend(zipkin server) failed to handle the post request, Collector failed to process the data and kept buffering the data in the collector size and get OOM finally?

^[[36motel-agent_1              |^[[0m 2020-04-06T22:38:48.432713200Z   {"exporter": "logging"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:38:48.432735000Z {"level":"warn","ts":1586212728.2547228,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.19.0.4:14250: connect: connection refused\"","spanFormat":"zipkin"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:38:48.432767200Z {"level":"warn","ts":1586212728.2583811,"caller":"queuedprocessor/queued_processor.go:199","msg":"Failed to process batch, re-enqueued","processor":"queued_retry","batch-size":5}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.559982900Z {"level":"warn","ts":1586212488.5281117,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"Post \"http://zipkin-all-in-one:9411/api/v2/spans\": dial tcp 172.19.0.2:9411: connect: connection refused","spanFormat":"zipkin"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560011900Z {"level":"warn","ts":1586212488.5281756,"caller":"queuedprocessor/queued_processor.go:199","msg":"Failed to process batch, re-enqueued","processor":"queued_retry","batch-size":16}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560034700Z {"level":"warn","ts":1586212488.5282025,"caller":"queuedprocessor/queued_processor.go:205","msg":"Backing off before next attempt","processor":"queued_retry","backoff_delay":5}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560155800Z {"level":"warn","ts":1586212488.535339,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"Post \"http://zipkin-all-in-one:9411/api/v2/spans\": dial tcp 172.19.0.2:9411: connect: connection refused","spanFormat":"zipkin"}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560244700Z {"level":"warn","ts":1586212488.5357203,"caller":"queuedprocessor/queued_processor.go:199","msg":"Failed to process batch, re-enqueued","processor":"queued_retry","batch-size":3}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560279000Z {"level":"warn","ts":1586212488.5359087,"caller":"queuedprocessor/queued_processor.go:205","msg":"Backing off before next attempt","processor":"queued_retry","backoff_delay":5}
^[[36motel-agent_1              |^[[0m 2020-04-06T22:34:48.560323400Z {"level":"warn","ts":1586212488.5417295,"caller":"queuedprocessor/queued_processor.go:187","msg":"Sender failed","processor":"queued_retry","error":"Post \"http://zipkin-all-in-one:9411/api/v2/spans\": dial tcp 172.19.0.2:9411: connect: connection refused","spanFormat":"zipkin"}

james-bebbington commented 4 years ago

I did a bit of investigation on this and as far as I can tell this seems to be working as intended (no memory leak), although there are definitely things that could be improved.

If I configured the Zipkin and Jaeger back ends correctly I was able to push through 1000s of traces per second with minimal memory usage, but if I didn't configure a Zipkin or Jaeger back end (or misconfigured the exporter), then the unsent traces would backup and consume a lot of memory (but not unbounded).

The minimal needed to replicate this is to start the following containers (as per examples/demo):

demo_jaeger-emitter_1
demo_otel-agent_1

^ Configure the emitter to push a large number of traces

traces:
      receivers: [jaeger]
      exporters: [jaeger] # or zipkin
      processors: [queued_retry]

The queued_retry processor defaults to a limit of 5000 items in the queue (+10 in process), but "items" are in this case batches of traces received from Jaeger. The emitter used in the demo uses a 5s flush interval and the Jaeger client defaults to 1MB max packet size. So if you're pushing a decent number of reasonably large traces, you'll be sending through a lot of large packets (say 500/s * ~500 bytes per trace > 1MB/5s), and you're looking at consuming 5000 * 1MB = 5GB of memory in total (in my tests actual memory usage was roughly 50% more than this, presumably because of extra overhead with the internal representation, etc). This will accumulate pretty slowly, maxing out eventually after around 30 minutes (depending on how frequently you're pushing spans), but presumably consuming enough memory to make you see OOM errors sometime before that.

Note we use Jaeger's bounded queue internally. They recently added a Resize() function to the bounded queue to allow them to support adjusting the queue size based on consumed memory: https://github.com/jaegertracing/jaeger/issues/943

That's something we might want to consider for the future. For now, it might be a good idea to reduce the default queue size somewhat (maybe 2000 is more reasonable, although I'm not sure how large batches we would reasonably expect from various receivers?).

Not completely relevant, but its also worth noting this regarding how items are dropped if the bounded queue reaches its limit: https://github.com/jaegertracing/jaeger/issues/1947

tigrannajaryan commented 4 years ago

Closing based on comment from @james-bebbington Feel free to reopen if there is evidence that the leak exists.

open-telemetry / opentelemetry-collector

OTel Collector potential memory leak #802