grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.76k stars 488 forks source link

Streaming gRPC does not avoid 'message larger than max' error #3765

Open akevdmeer opened 3 weeks ago

akevdmeer commented 3 weeks ago

When querying Tempo with (streaming) gRPC using tempo-cli --use-grpc to do a large query, tempo-cli fails:

tempo-cli: error: main.metricsQueryRangeCmd.Run(): rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4560241 vs. 4194304)

.. and the same when using the same max message size on tempo-cli as we have previously configured on the server:

tempo-cli: error: main.metricsQueryRangeCmd.Run(): rpc error: code = Internal desc = response larger than the max (16842812 vs 16777216)

To Reproduce Perform a query with tempo-cli that returns a lot of data, specifically I tested with tempo-cli query api metrics localhost:3100 '{ selector } | quantile_over_time(duration, .99) by (span.http.url)' starttime endtime --use-grpc

Expected behavior The streaming gRPC should avoid the max message size by chunking appropriately, allowing to run large queries.

Environment: Tempo 2.5.0 on Kubernetes 1.28

joe-elliott commented 3 weeks ago

While the other streaming endpoints only send a diff for every message the metrics streaming endpoint currently sends all data every message.

We should update the metrics streaming endpoint to only send the diff by improving this method:

https://github.com/grafana/tempo/blob/main/modules/frontend/combiner/metrics_query_range.go#L63

akevdmeer commented 3 weeks ago

I see, that provokes the error easily!

But isn't it still possible to exceed the max message size for the connection also with a / where this is a proper diff(), when it receives more fresh data than fits between two GRPCCollector updates?

joe-elliott commented 3 weeks ago

But isn't it still possible to exceed the max message size for the connection also with a / where this is a proper diff(), when it receives more fresh data than fits between two GRPCCollector updates?

This is correct. We could try to do something that cut diffs aggressively for larger datasets. It feels like it would be quite difficult to do well and would have some difficult to handle edge cases. I'm open to hearing a proposal, but this is currently a lower priority for us.

akevdmeer commented 3 weeks ago

Clear thank you. A proper diff() for the metrics queries + larger grpc max size may solve the issue, practically speaking.