Open madaraszg-tulip opened 2 months ago
Tempo does return success as soon as two of three writes to ingesters succeed, but it shouldn't be cancelling the third. It would be interesting to review metrics to see why this might be occurring.
All ingesters are healthy. This happens in all three environments that we have (prod, staging, testing). Latency is uniform and stable across all ingesters in all environments. Dashboard tells me 2.5ms for the median, 4.95ms for the 99th. They run in AWS EKS, and the cluster is healthy.
Some additional information, focusing on our testing instance now, as it is the smallest and has the lowest load. All the tempo pods run on a single node, dedicated to this tempo instance. There's more than enough CPU and memory (c6g.large: 2 core, 4GB). Running one distributor and three ingesters. Sustained load on the distributor is 50 spans / second.
Span error rate on the distributor is 0.7/sec
Forwarder pushes is about 0.95/sec
Describe the bug We are tracing all our monitoring stack, including tempo. We are also generating service graph which shows that a significant portion of tempo-distributor to tempo-ingester calls are errors, but those are only "context cancelled" calls, don't seem to be actual errors.
To Reproduce Steps to reproduce the behavior:
Expected behavior I would expect not to be shown continuous errors in our tempo installation
Environment:
Additional Context
Basically every trace shows distributor doing PushBytesV2 against 3 ingesters, and when 2 ingesters respond, the third call is cancelled on the distributor. Either this is the intended behavior, in that case this cancelled call should not be marked as an error in the span, or it is an actual issue, and then it needs to be fixed.
We are doing tail sampling of traces, primarily percentage based, but also forwarding all traces that have errors. This means that practically all traces from the distributor are sampled because all have errors.