Closed albertteoh closed 2 years ago
@gouthamve, is this something you can help with?
Thanks @jpkrohling. 🙏🏼
I have more evidence to suggest this is a bug in prometheusremotewrite
exporter (I've updated the issue description with this exporter name) because I ran a local prometheus server off the same pipeline but scraping the prometheus exporter's data and it looks fine.
The following are screenshots of exactly the same data, with identical queries and timestamp ranges.
The first screenshot is from a locally running Grafana querying against a local prometheus server scraping data from my local OTEL collector's prometheusexporter
.
The second screenshot is from an observability provider (M3 is the backing metrics store) where by the prometheusremotewrite
exporter is configured to send metrics to, and the visualisation is consistent with the numbers I see in my local otel collector logs, which confirms the problems resides within the prometheusremotewrite
exporter, whether if it's a bug of misconfiguration.
prometheus exporter
Looks fine, the data is null
after 21:04:30.
prometheus remote write exporter
Logs show the number jumping to the MaxInt64 value as reported earlier:
Data point attributes:
-> http_method: STRING(GET)
-> http_status_code: STRING(200)
-> operation: STRING(HTTP GET /route)
-> p8s_logzio_name: STRING(spm-demo-otel)
-> service_name: STRING(route)
-> span_kind: STRING(SPAN_KIND_SERVER)
-> status_code: STRING(STATUS_CODE_UNSET)
StartTimestamp: 2021-12-28 10:03:16.465 +0000 UTC
Timestamp: 2021-12-28 10:04:16.465 +0000 UTC
Count: 50
Sum: 2652.758000
ExplicitBounds #0: 2.000000
ExplicitBounds #1: 6.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 100.000000
ExplicitBounds #4: 250.000000
ExplicitBounds #5: 500.000000
ExplicitBounds #6: 1000.000000
ExplicitBounds #7: 10000.000000
ExplicitBounds #8: 100000.000000
ExplicitBounds #9: 1000000.000000
ExplicitBounds #10: 9223372036854.775391
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 50
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0
Buckets #11, Count: 0
...
2021-12-28T10:04:31.166Z debug prometheusexporter@v0.41.0/accumulator.go:246 metric expired: latency, deleting key: Histogram*spanmetricsprocessor*latency*http.method*GET*http.status_code*200*operation*HTTP GET /route*service.name*route*span.kind*SPAN_KIND_SERVER*status.code*STATUS_CODE_UNSET {"kind": "exporter", "name": "prometheus"}
...
Data point attributes:
-> http_method: STRING(GET)
-> http_status_code: STRING(200)
-> operation: STRING(HTTP GET /route)
-> p8s_logzio_name: STRING(spm-demo-otel)
-> service_name: STRING(route)
-> span_kind: STRING(SPAN_KIND_SERVER)
-> status_code: STRING(STATUS_CODE_UNSET)
StartTimestamp: 2021-12-28 10:04:31.465 +0000 UTC
Timestamp: 2021-12-28 10:04:31.465 +0000 UTC
Count: 9223372036854775808
Sum: NaN
ExplicitBounds #0: 2.000000
ExplicitBounds #1: 6.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 100.000000
ExplicitBounds #4: 250.000000
ExplicitBounds #5: 500.000000
ExplicitBounds #6: 1000.000000
ExplicitBounds #7: 10000.000000
ExplicitBounds #8: 100000.000000
ExplicitBounds #9: 1000000.000000
ExplicitBounds #10: 9223372036854.775391
Buckets #0, Count: 9223372036854775808
Buckets #1, Count: 9223372036854775808
Buckets #2, Count: 9223372036854775808
Buckets #3, Count: 9223372036854775808
Buckets #4, Count: 9223372036854775808
Buckets #5, Count: 9223372036854775808
Buckets #6, Count: 9223372036854775808
Buckets #7, Count: 9223372036854775808
Buckets #8, Count: 9223372036854775808
Buckets #9, Count: 9223372036854775808
Buckets #10, Count: 9223372036854775808
Buckets #11, Count: 9223372036854775808
...
I'm confused. The issue title says "prometheusremotewrite", but the sample config uses the promethus
exporter. The prometheusremotewrite
exporter does not have the referenced metric_expiration
configuration option.
It looks like what is happening here is that the data flow is something like spanmetrics->promexp->promrecv->log
. The prometheus
exporter is correctly expiring the metrics and stops emitting them. At that point the Prometheus scrape manager used by the prometheus
receiver emits a signalling NaN value to indicate that the metric is stale. Because of a data model mismatch between Prometheus and OTLP, that value for bucket and distribution counts cannot be properly represented since OTLP uses integers for those values. There is a flag in the OTLP data model that is intended to convey the same information, but it is not currently set by the prometheus
receiver. The prometheusremotewrite
receiver has been updated to recognize this flag and emit the Prometheus signaling NaN. There is a PR to do the same for the prometheus
exporter (or, more correctly, to delete metrics with that flag which will cause the scraper retrieving those metrics to emit the signalling NaN as appropriate). We had intended to update the promethus
receiver to begin setting those flags once they were correctly handled in both receivers.
@albertteoh can you try building a collector from this branch and let me know if the issue persists?
I'm confused. The issue title says "prometheusremotewrite", but the sample config uses the promethus exporter. The prometheusremotewrite exporter does not have the referenced metric_expiration configuration option.
Ah, sorry, silly mistake on my part. I had originally titled it with promethus
exporter, but I jumped to an incorrect conclusion in my comment especially given the logs are in the same export group as the prometheusremotewrite
exporter. 🤦🏼♂️ I've reverted the title name change for posterity.
can you try building a collector from this branch and let me know if the issue persists?
I tried your branch and looks good 👍🏼
Thanks very much for the quick turnaround, @Aneurysm9!
Describe the bug When a metric expires, the value for the metric appears to overflow the max int64 value:
9223372036854775808
.Steps to reproduce
docker run --rm --network="host" --env JAEGER_AGENT_HOST=localhost --env JAEGER_AGENT_PORT=6835 -p8080-8083:8080-8083 jaegertracing/example-hotrod:latest all
6835
.latency_bucket
metrics are correct for the first minute of data, and make a note of which bucket hascount > 0
. For example, where the metric has labelle = "250"
.latency_bucket{service_name = "driver", le="250"}
What did you expect to see?
After a minute (on metric expiry), the metric with
le = "250"
(example querylatency_bucket{service_name = "driver", le="250"}
) should no longer be query-able (a null value).What did you see instead?
After a minute (on metric expiry), the metric with
le = "250"
(example querylatency_bucket{service_name = "driver", le="250"}
) will jump to9223372036854775808
.What version did you use? Version:
main
branchWhat config did you use? Config:
Environment OS: "Ubuntu 20.04" Compiler(if manually compiled): "go 1.17.5")
Additional context
spanmetrics
processor should be doing to avoid this problem?metric_expiration
duration?The following screenshots illustrate the problem for the default
5m
metric_expiration
configuration:Before Expiry
After Expiry
The logs also reflect the above screenshots, showing the correct metrics initially, then suddenly jumping to a very large count:
le = "250"
bucket has a count of3
.le = "250"
bucket jumps to a very large number, which appears to be 1 + MaxInt64.