grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.76k stars 488 forks source link

Continuous span errors while tracing tempo #3645

Open madaraszg-tulip opened 2 months ago

madaraszg-tulip commented 2 months ago

Describe the bug We are tracing all our monitoring stack, including tempo. We are also generating service graph which shows that a significant portion of tempo-distributor to tempo-ingester calls are errors, but those are only "context cancelled" calls, don't seem to be actual errors.

To Reproduce Steps to reproduce the behavior:

  1. Configure tempo 2.4.1 to trace itself (we do this through alloy, which also does tail sampling)
  2. Configure service graph generation (again, we are doing this in alloy)
  3. See the red section in the tempo-ingester service graph node.

Expected behavior I would expect not to be shown continuous errors in our tempo installation

Environment:

Additional Context

image

image

Basically every trace shows distributor doing PushBytesV2 against 3 ingesters, and when 2 ingesters respond, the third call is cancelled on the distributor. Either this is the intended behavior, in that case this cancelled call should not be marked as an error in the span, or it is an actual issue, and then it needs to be fixed.

We are doing tail sampling of traces, primarily percentage based, but also forwarding all traces that have errors. This means that practically all traces from the distributor are sampled because all have errors.

joe-elliott commented 1 month ago

Tempo does return success as soon as two of three writes to ingesters succeed, but it shouldn't be cancelling the third. It would be interesting to review metrics to see why this might be occurring.

madaraszg-tulip commented 1 month ago

All ingesters are healthy. This happens in all three environments that we have (prod, staging, testing). Latency is uniform and stable across all ingesters in all environments. Dashboard tells me 2.5ms for the median, 4.95ms for the 99th. They run in AWS EKS, and the cluster is healthy.

madaraszg-tulip commented 1 month ago

Some additional information, focusing on our testing instance now, as it is the smallest and has the lowest load. All the tempo pods run on a single node, dedicated to this tempo instance. There's more than enough CPU and memory (c6g.large: 2 core, 4GB). Running one distributor and three ingesters. Sustained load on the distributor is 50 spans / second.

image

Span error rate on the distributor is 0.7/sec

image

Forwarder pushes is about 0.95/sec

image