Open msvechla opened 1 month ago
Hi, agree this shouldn't be happening. We add a unique label __metrics_gen_instance
to the series generated by each metrics-generator to ensure the series are unique between them. Since you are running on Kubernetes the hostnames for each pod should be unique.
Can you check some things?
traces_spanmetrics_latency_bucket
or to all metrics types? (so also counters)Thanks!
Our current theory is that this might be due to how we are injecting a 0 sample when adding new histogram series. Whenever we remote write a serie for the first time, we inject a 0 sample right before appending the actual value to clearly mark is as going from 0 to the new value:
This code adds a sample with value 0
1ms before the normal sample.
We have similar code in counter. But instead of injecting a sample 1ms earlier, we delay the next samples by 1s:
So if you only see histograms causing the duplicate sample errors, that is a clear indicator something in that implementation is not right.
Hi and thanks for your detailed analysis!
Does this only happen to traces_spanmetrics_latency_bucket or to all metrics types? (so also counters)
- It appears to be happening with all kinds of series, however the largest amount of errors is coming from the histogram metrics
Do the log lines have mostly the same attributes or are they very distinct? As in, is it just one serie being repeated over and over ro do you see a mix of series.
- most errors have very similar attributes, all of them have a very generic
span_name
likePOST
orGET
Do you happen to have out-of-order sample ingestion enabled in Mimir? We shouldn't need it, but it might be interesting to see if this changes things.
- I will give this a try and report back
@kvrhdn I also tried enabling out-of-order sample ingestion, but this had no effect at all. We are still getting the same amount of err-mimir-sample-duplicate-timestamp
Do you maybe have any relabel configs on your remote write? Maybe you are dropping labels that would make series unique.
Or do you maybe have multiple sources of these metrics? (e.g. you are sending from both Tempo and Alloy)
Our current theory is that this might be due to how we are injecting a 0 sample when adding new histogram series.
We did some more investigation in this and it shouldn't be the issue. The errors we were seeing somewhere else were related to aggregation.
Do you maybe have any relabel configs on your remote write? Maybe you are dropping labels that would make series unique.
The only remote-write config we have I posted in our config above. To my knowledge this should just rename the metrics and not cause any loss in uniqueness:
storage:
path: /var/tempo/wal
remote_write:
- send_exemplars: true
url: ...
write_relabel_configs:
- regex: ^(.+)$
source_labels:
- http_method
target_label: http_request_method
- regex: ^(.+)$
source_labels:
- http_status_code
target_label: http_response_status_code
- action: labeldrop
regex: ^http_method|http_status_code$
@kvrhdn it looks like removing the above mentioned write_relabel_configs
fixes the issue.
Any idea what is causing issues with this relabel configs? As I mentioned, I just want to rename http_method
to http_request_method
and http_status_code
to http_response_status_code
.
The ingested metrics look exactly like we would expect:
With relabel config:
traces_spanmetrics_calls_total{__metrics_gen_instance="tempo-metrics-generator-85ccf9bcdc-lw8sh",http_request_method="GET",http_response_status_code="200",platform_aws_organizational_unit="fleetops",platform_aws_stage="prod",service="fmm-graphql-gateway",span_kind="SPAN_KIND_CLIENT",span_name="GET",status_code="STATUS_CODE_UNSET"}
Without relabel config:
traces_spanmetrics_calls_total{__metrics_gen_instance="tempo-metrics-generator-85ccf9bcdc-lw8sh",http_method="GET",http_status_code="200",platform_aws_organizational_unit="fleetops",platform_aws_stage="prod",service="fmm-graphql-gateway",span_kind="SPAN_KIND_CLIENT",span_name="GET",status_code="STATUS_CODE_UNSET"}
Nice find! Yeah, I'm not sure why this relabel config is causing issues (but I'm also not an expert on this). Maybe ^(.+)$
is too restrictive? You could try (.*)
instead (the default) and see if that changes things.
Metrics-generator also has support built in to remap dimensions: it's the dimensions_mappings
setting
https://grafana.com/docs/tempo/latest/configuration/#metrics-generator
I think this should work:
metrics_generator:
processor:
span_metrics:
dimension_mappings:
- name: http_request_method
source_labels: http.method
join: ''
- name: http_response_status_code
source_labels: http.status_code
join: ''
Note that source_labels
is the attribute name on the span, name
is the prometheus label. This can be a bit confusing.
Describe the bug
Hello,
We are using the Tempo metrics-generator to generate span metrics from traces.
In general this works, however our metrics-generator is throwing lots of
err-mimir-sample-duplicate-timestamp
errors in the logs.The error is thrown on average about 250 times per minute:
Some sample log lines:
In our infrastructure this seems to be mostly coming from metrics generated from auto-instrumented nodejs services, however this might be the case for other services as well.
Expected behavior
err-mimir-sample-duplicate-timestamp
should not be thrown regularlyEnvironment:
Additional Context
metrics-generator config:
Thanks a lot for your help!