grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.05k stars 524 forks source link

metrics-generator excessive err-mimir-sample-duplicate-timestamp #4201

Open msvechla opened 1 month ago

msvechla commented 1 month ago

Describe the bug

Hello,

We are using the Tempo metrics-generator to generate span metrics from traces.

In general this works, however our metrics-generator is throwing lots of err-mimir-sample-duplicate-timestamp errors in the logs.

The error is thrown on average about 250 times per minute: Image

Some sample log lines:

ts=2024-10-17T10:42:13.405615049Z caller=dedupe.go:112 tenant=single-tenant component=remote level=error remote_name=45e95a url=http://mimir-gateway.default.svc.cluster.local/api/v1/push msg="non-recoverable error" failedSampleCount=2000 failedHistogramCount=0 failedExemplarCount=0 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester mimir-ingester-2: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-10-17T10:41:57.946Z and is from series traces_spanmetrics_latency_bucket{__metrics_gen_instance=\"tempo-metrics-generator-dcf9fd768-z29fq\", http_request_method=\"POST\", http_response_status_code=\"200\", le=\"1.02\", platform_aws_organizational_unit=\"fleetassets\", platform_aws_stage=\"prod\", service=\"vit-image-detect-labels\", span_kind=\"SPAN_KIND_CLIENT\", span_name=\"POST\", status_code=\"STATUS_CODE_UNSET\"} (sampled 1/10)\n"
ts=2024-10-17T10:42:11.722394528Z caller=dedupe.go:112 tenant=single-tenant component=remote level=error remote_name=45e95a url=http://mimir-gateway.default.svc.cluster.local/api/v1/push msg="non-recoverable error" failedSampleCount=2000 failedHistogramCount=0 failedExemplarCount=0 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester mimir-ingester-2: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-10-17T10:41:57.946Z and is from series traces_spanmetrics_latency_bucket{__metrics_gen_instance=\"tempo-metrics-generator-dcf9fd768-z29fq\", http_request_method=\"GET\", http_response_status_code=\"200\", le=\"1.02\", platform_aws_organizational_unit=\"fleetops\", platform_aws_stage=\"prod\", service=\"fmm-graphql-gateway\", span_kind=\"SPAN_KIND_CLIENT\", span_name=\"GET\", status_code=\"STATUS_CODE_UNSET\"} (sampled 1/10)\n"

In our infrastructure this seems to be mostly coming from metrics generated from auto-instrumented nodejs services, however this might be the case for other services as well.

Expected behavior

err-mimir-sample-duplicate-timestamp should not be thrown regularly

Environment:

Additional Context

metrics-generator config:

    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions:
          - platform.aws.organizational_unit
          - platform.aws.stage
          histogram_buckets:
          - 0.1
          - 0.2
          - 0.4
          - 0.8
          - 1.6
          - 3.2
          - 6.4
          - 12.8
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimension_mappings:
          - name: http_request_method
            source_labels:
            - http.method
          - name: http_response_status_code
            source_labels:
            - http.status_code
          dimensions:
          - platform.aws.organizational_unit
          - platform.aws.stage
          - messaging.operation
          - http.request.method
          - http.response.status_code
          - http.route
          - http.method
          - http.status_code
          histogram_buckets:
          - 0.002
          - 0.004
          - 0.008
          - 0.016
          - 0.032
          - 0.064
          - 0.128
          - 0.256
          - 0.512
          - 1.02
          - 2.05
          - 4.1
      registry:
        collection_interval: 15s
        external_labels: {}
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/wal
        remote_write:
        - send_exemplars: true
          url: ...
          write_relabel_configs:
          - regex: ^(.+)$
            source_labels:
            - http_method
            target_label: http_request_method
          - regex: ^(.+)$
            source_labels:
            - http_status_code
            target_label: http_response_status_code
          - action: labeldrop
            regex: ^http_method|http_status_code$

Thanks a lot for your help!

yvrhdn commented 1 month ago

Hi, agree this shouldn't be happening. We add a unique label __metrics_gen_instance to the series generated by each metrics-generator to ensure the series are unique between them. Since you are running on Kubernetes the hostnames for each pod should be unique.

Can you check some things?

Thanks!


Our current theory is that this might be due to how we are injecting a 0 sample when adding new histogram series. Whenever we remote write a serie for the first time, we inject a 0 sample right before appending the actual value to clearly mark is as going from 0 to the new value:

https://github.com/grafana/tempo/blob/1652db28469759696558bbf153f3f19475ed468f/modules/generator/registry/histogram.go#L190-L199

This code adds a sample with value 0 1ms before the normal sample.

We have similar code in counter. But instead of injecting a sample 1ms earlier, we delay the next samples by 1s:

https://github.com/grafana/tempo/blob/1652db28469759696558bbf153f3f19475ed468f/modules/generator/registry/counter.go#L157-L169

So if you only see histograms causing the duplicate sample errors, that is a clear indicator something in that implementation is not right.

msvechla commented 1 month ago

Hi and thanks for your detailed analysis!

Image

Does this only happen to traces_spanmetrics_latency_bucket or to all metrics types? (so also counters)

  • It appears to be happening with all kinds of series, however the largest amount of errors is coming from the histogram metrics

Do the log lines have mostly the same attributes or are they very distinct? As in, is it just one serie being repeated over and over ro do you see a mix of series.

  • most errors have very similar attributes, all of them have a very generic span_name like POST or GET

Do you happen to have out-of-order sample ingestion enabled in Mimir? We shouldn't need it, but it might be interesting to see if this changes things.

  • I will give this a try and report back
msvechla commented 1 month ago

@kvrhdn I also tried enabling out-of-order sample ingestion, but this had no effect at all. We are still getting the same amount of err-mimir-sample-duplicate-timestamp

yvrhdn commented 1 month ago

Do you maybe have any relabel configs on your remote write? Maybe you are dropping labels that would make series unique.

Or do you maybe have multiple sources of these metrics? (e.g. you are sending from both Tempo and Alloy)

Our current theory is that this might be due to how we are injecting a 0 sample when adding new histogram series.

We did some more investigation in this and it shouldn't be the issue. The errors we were seeing somewhere else were related to aggregation.

msvechla commented 1 month ago

Do you maybe have any relabel configs on your remote write? Maybe you are dropping labels that would make series unique.

The only remote-write config we have I posted in our config above. To my knowledge this should just rename the metrics and not cause any loss in uniqueness:

storage:
        path: /var/tempo/wal
        remote_write:
        - send_exemplars: true
          url: ...
          write_relabel_configs:
          - regex: ^(.+)$
            source_labels:
            - http_method
            target_label: http_request_method
          - regex: ^(.+)$
            source_labels:
            - http_status_code
            target_label: http_response_status_code
          - action: labeldrop
            regex: ^http_method|http_status_code$
msvechla commented 1 month ago

@kvrhdn it looks like removing the above mentioned write_relabel_configs fixes the issue.

Any idea what is causing issues with this relabel configs? As I mentioned, I just want to rename http_method to http_request_method and http_status_code to http_response_status_code.

The ingested metrics look exactly like we would expect:

With relabel config:

traces_spanmetrics_calls_total{__metrics_gen_instance="tempo-metrics-generator-85ccf9bcdc-lw8sh",http_request_method="GET",http_response_status_code="200",platform_aws_organizational_unit="fleetops",platform_aws_stage="prod",service="fmm-graphql-gateway",span_kind="SPAN_KIND_CLIENT",span_name="GET",status_code="STATUS_CODE_UNSET"}

Without relabel config:

traces_spanmetrics_calls_total{__metrics_gen_instance="tempo-metrics-generator-85ccf9bcdc-lw8sh",http_method="GET",http_status_code="200",platform_aws_organizational_unit="fleetops",platform_aws_stage="prod",service="fmm-graphql-gateway",span_kind="SPAN_KIND_CLIENT",span_name="GET",status_code="STATUS_CODE_UNSET"}
yvrhdn commented 1 month ago

Nice find! Yeah, I'm not sure why this relabel config is causing issues (but I'm also not an expert on this). Maybe ^(.+)$ is too restrictive? You could try (.*) instead (the default) and see if that changes things.

Metrics-generator also has support built in to remap dimensions: it's the dimensions_mappings setting https://grafana.com/docs/tempo/latest/configuration/#metrics-generator

I think this should work:

metrics_generator:
  processor:
    span_metrics:
      dimension_mappings:
        - name: http_request_method
          source_labels: http.method
          join: ''
        - name: http_response_status_code
          source_labels: http.status_code
          join: ''

Note that source_labels is the attribute name on the span, name is the prometheus label. This can be a bit confusing.