open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.22k stars 1.4k forks source link

batchprocessor: send_batch_max_size_bytes limit #6046

Open jangaraj opened 1 year ago

jangaraj commented 1 year ago

Is your feature request related to a problem? Please describe. Golang GRPC server has default message limit 4MB. Batch processor can generate bigger message size, so receiver will reject batch and whole batch can be dropped:

"msg": "Exporting failed. The error is not retryable. Dropping data.",
"kind": "exporter",
"data_type": "traces",
"name": "otlp",
"error": "Permanent error: rpc error: code = ResourceExhausted desc = grpc: received message after decompression larger than max (5297928 vs. 4194304)",
"dropped_items": 4725,

Current batchprocessor config options doesn't provide opportunity to prevent this situation, because they works only with span counts, but not with whole batch size. send_batch_max_size is also count of spans.

Describe the solution you'd like New config option send_batch_max_size_bytes (maybe there can be better name), where will be defined default GRPC 4MB size (4194304), which will ensure that batch won't exceed this size.

Describe alternatives you've considered At the moment user can customize send_batch_size/send_batch_max_size, but in theory there can be a few traces with huge spans (e.g. Java backtraces with logs) and default 4MB grpc message limit can be exceeded. Maybe OTLP exporter may handle this message limitation.

evandam commented 1 year ago

:+1: for this feature. We're currently doing some trial and error to figure out the right balance of send_batch_size and send_batch_max_size and hoping it stays under 4MB but having a guarantee would definitely be preferred.

jangaraj commented 1 year ago

@evandam I made some recommendations here https://github.com/monitoringartist/opentelemetry-trace-pipeline-poisoning#mitigation-of-huge-4mb-trace

evandam commented 1 year ago

Nice link, thank you! It definitely still relies on some back-of-the-envelope math which is bound to be wrong sooner or later, and it would be great to have an easy way to do this at the exporter/collector level.

dmitryax commented 1 year ago

The size-based batching will only work if the processor is being used with OTLP exporter, but other exporters will have different batch sizes due to different encoding. I believe if we go with https://github.com/open-telemetry/opentelemetry-collector/issues/4646, we should be able to provided this for any exporter

cwegener commented 10 months ago

Describe alternatives you've considered At the moment user can customize send_batch_size/send_batch_max_size, but in theory there can be a few traces with huge spans (e.g. Java backtraces with logs) and default 4MB grpc message limit can be exceeded. Maybe OTLP exporter may handle this message limitation.

For those willing to configure a different amount of memory to be allocated for each GRPC message on the downstream OTLP Collectors' Receiver config, there also is the max_recv_msg_size_mib option.

https://github.com/open-telemetry/opentelemetry-collector/issues/1122#issuecomment-1765478663

elysiumHL commented 10 months ago

Describe alternatives you've considered At the moment user can customize send_batch_size/send_batch_max_size, but in theory there can be a few traces with huge spans (e.g. Java backtraces with logs) and default 4MB grpc message limit can be exceeded. Maybe OTLP exporter may handle this message limitation.

For those willing to configure a different amount of memory to be allocated for each GRPC message on the downstream OTLP Collectors' Receiver config, there also is the max_recv_msg_size_mib option.

#1122 (comment)

cwegener commented 10 months ago

this param did not work for otlp exporter

No it won't. the maximum receive message size is only for the gRPC server side.

On the gRPC client side, the client's max receive message size must be provided in the call options when the client makes a call to the gRPC server.

What is your OTEL collector use case where the exporter receives such large messages from the remote OTLP receiver though? I cannot think of a scenario where this would even be the case.

lmnogues commented 2 months ago

did you manage to solves this issue ?

ptodev commented 2 months ago

I think this is the issue which would resolve this eventually.

smoke commented 1 month ago

Is there a way to dump / debug spans causing that?

Update: I have figured it out configuring the otel collector this way, so it prints both the error message and the all of the Span details it sends

...
config:
  exporters:
    otlp:
      endpoint: "otel-lb-collector:4317"
      tls:
        insecure: true
    debug: {}
    debug/detailed:
      verbosity: detailed
  extensions:
    health_check: {}
  processors:
    resourcedetection:
      detectors: [env, system]
    batch:
      send_batch_size: 1
      send_batch_max_size: 1
  ...
  service:
  ...
    pipelines:
    ...
      traces:
        exporters:
          - debug
          - debug/detailed
          - otlp
    ...

In my case the culprit was python Pymongo instrumentation with enabled capture_statement, so all of the content of a insert statement was captured It was sent to otel-agent through otlp/http fine and then error happens when otel-agent send through otlp/grpc to otel-gw.