grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.37k stars 199 forks source link

tail_sampling fails on traces larger than 4Mb #481

Open serhij-matvejev opened 1 year ago

serhij-matvejev commented 1 year ago

Hi,

We have an issue with tail_sampling when the trace is larger than 4Mb. The grafana-agent supports traces larger than (default) 4Mb. The limit could be changed via the agent arguments (-server.grpc.max-recv-msg-size-bytes and -server.grpc.max-send-msg-size-bytes) and it works good when you just send traces app -> agent -> tempo.

But once you enable tail_sampling and send any trace larger than (default) 4Mb - it fails. The error message looks so:

ts=2023-03-01T09:44:56.763536564Z caller=zapadapter.go:84 level=error component=traces traces_config=default kind=exporter data_type=traces name=otlp/0 msg="Exporting failed. The error is not retryable. Dropping data." error="Permanent error: rpc error: code = ResourceExhausted desc = grpc: received message after decompression larger than max (4194693 vs. 4194304)" dropped_items=1

So it drops the trace and it will not go through the agent to the tempo.

With smaller traces tail_sampling works good.

We have tested it on agent versions: v0.28.1, v0.31.3 and v0.32.0. With static mode.

Probably it is possible to adjust the code and use non-default value (take it from MaxRecvMsgSize and MaxSendMsgSize if possible).

yashumitsu commented 1 year ago

Hello, we encountered a similar error, although in our case the limit was related to the batch processor. After reducing batch _size's, we didn't encounter dropped spans, even though we are also using tail_sampling.

tpaschalis commented 1 year ago

We should amend our production recommendation docs to mention the batch processor.

clayton-cornell commented 1 year ago

@tpaschalis I'll need some dev team guidance here to know what to add to the docs. The two primary places we document tail_sampling in the documentation are: