Spanmetrics: request rate metrics are jagged, 0 value every minute

grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.

https://grafana.com/oss/tempo/

GNU Affero General Public License v3.0

4.03k stars 522 forks source link

Spanmetrics: request rate metrics are jagged, 0 value every minute #2516

Closed domasx2 closed 1 year ago

domasx2 commented 1 year ago

Describe the bug rate over spanmetrics are suspiciously jagged, dropping to 0 every minute

explore in grafana ops

To Reproduce Steps to reproduce the behavior:

Enable spanmetrics
Send good volume of traces
Do a rate and note that every minute spanmetrics timeseries produce 0 or near 0 value

Goutham: So basically the datapoints are sent every 15s, but they are updated every 1m. The problem is that while samples are being sent every 15s, the update of the metric happens only every minute.

For the "instant" query: traces_spanmetrics_latency_count{__metrics_gen_instance="metrics-generator-5684fd747f-7zgwt",cluster=.........}[10m], you get:

Here you can see that the metrics are updated only 1m, but sent every 15s.

cyrille-leclerc commented 1 year ago

Is it correct to say that we don't reproduce this problem on the "Otel Demo" environment nor in our other testing environments? Could there be something special about this environment? Could the Grafana Ops environment be different from "standard environments"?

domasx2 commented 1 year ago

@cyrille-leclerc I think you are correct. Ops prometheus datasource has scrape interval set to 15s, while "standard" cloud instance have 60s. That should explain it

domasx2 commented 1 year ago

What I said above is wrong, it's not due to scrape interval. dev env data point updates & scrapes are aligned. Different Tempo configuration?

mapno commented 1 year ago

After investigation, we got to the conclusion that the principal cause of the sawtooth like metrics when applying rate() is the batching and buffering since a span is created until it reaches the metrics-generator. This causes that spans will arrive in small bursts to the processors, more that they're naturally being generated.