Closed albertteoh closed 1 year ago
@albertteoh Could you make this work? Or any workaround?
Hi @ankitnayan, my workaround was to filter out any latencies > 24 hours, which isn't nice but it does the job for my use case at least.
@albertteoh I have a similar issue, but using Prometheus.
spanMetricsProcessor is creating such bucket le="9.223372036854775e+12"
:
latency_bucket{http_status_code="200",operation="/health/xxxxx/health/**",service_name="xxxxx",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="9.223372036854775e+12"} 1
I guess the code of spanMetricsProcessor need to deal with golang number conversion when using float64.
Please have a look at these lines https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/spanmetricsprocessor/processor.go#L140
I've simulated here and the behavior was the same.
So, when Prometheus read that "number" it behavior in such way. I don't know if this doc can help.
@luistilingue I upgraded to v0.43.0
and it seems to have been fixed. Which version are you using?
@ankitnayan I was using 0.40.0 and I upgraded all my stack (otel collector to 0.46.0, prometheus to 2.33.5, and javaagent to 1.11.1), but the issue still occurs.
Hello all, I believe this is caused by a bug we found in prometheus that causes the +Inf
bucket to be added incorrectly, which in turn results in a negative number when converting cumulative datapoints to delta: https://github.com/prometheus/client_golang/issues/1147
We faced an issue whereby New Relic dropped our datapoints because of this. The issue existed with 0.61.0 but became much worse with 0.62.0. We built a custom image updating github.com/prometheus/client_golang
to the SHA version (dcea97eee2b3257f34fd3203cb922eedeabb42a6) that contained our fix and the issue disappeared:
cc @TylerHelmuth
Actually, looking at it, it is not the same bug, but the same issue: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/translator/prometheusremotewrite/helper.go#L350
The +Inf
bucket should have the same value as the total count:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/translator/prometheusremotewrite/helper.go#L306
Pinging code owners: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.
Pinging code owners: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
Describe the bug When using prometheusremotewrite to export metrics to M3, I'm getting latencies that are over 200 years when queried from M3.
However, when scraping these metrics from prometheus, the latencies look correct.
Am I configuring something incorrectly?
Steps to reproduce
What did you expect to see? Identical 95th percentile latencies, or at least close enough to one another.
What did you see instead? Latencies from M3 were over 200 years, whereas from Prometheus, they were a more sensible ~200ms.
Here are two screenshots of the same query executed against Prometheus and M3 data sources respectively:
Prometheus
M3
To reduce the search space by ruling out M3 and spanmetrics processor as possible causes, I also checked the logs (these are from an earlier run):
Here, I log the total latency_count as well as the latency_bucket counts within spanmetrics processor. I've taken logs from two different times, 10 seconds apart and as you can see, the count is consistent with the sum of bucket_counts:
However, this is the log output from the last metrics pipeline in the config below, i.e.:
As you can see the total count is
1
but the bucket count total is2 + 1 = 3
, and so I believe the+Inf
tries to account for this discrepancy, resulting in-2
represented as the uint64 equivalent of18446744073709551614
. I have also seen versions in logs where the total count > sum of bucket counts, leading to a "positive" spillover+Inf
count.What version did you use? Version: opentelemetry-collector-contrib@master
What config did you use? Config: (e.g. the yaml config file)
Environment OS: MacOS Compiler(if manually compiled): go 1.16
Additional context cc @bogdandrutu