Closed ImoutoChan closed 1 year ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Please close the original issue.
What version of prometheus are you using?
Version | 2.42.0 225c61122d88b01d1f0eaaee0e05b6f3e0567ac0
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
I think I just encountered this when setting up telemetry infrastructure for a system with a Grafana instance. I had configured Grafana to provide instrumentation metrics via a Prometheus endpoint (which worked like a charm) and later also traces via OTLP.
For simplicity, I had started with a configuration where all telemetry went through the collector. So for metrics there was both a Prometheus receiver that scraped services alongside other receivers and a Prometheus exporter as the connection point for the actual data storage. Enabling trace output from Grafana made it include exemplars in the metric output, which suddenly caused no metrics to appear any more at all.
After some digging, I found that Prometheus complains with the error
ts=2023-08-15T08:37:07.054Z caller=scrape.go:1387 level=debug component="scrape manager" scrape_pool=otel target=http://otel-collector:8889/metrics msg="Append failed" err="expected next entry after timestamp, got \" # \" (\"INVALID\") while parsing: \"grafana_http_request_duration_seconds_bucket{handler=\\\"/api/dashboards/home\\\",instance=\\\"grafana:3000\\\",job=\\\"grafana\\\",method=\\\"GET\\\",status_code=\\\"200\\\",le=\\\"0.005\\\"} 2 1.69208862132e+09 # \""
same as in the issue description.
The original metrics returned by the /metrics
endpoint of Grafana look like this:
grafana_http_request_duration_seconds_bucket{handler="/api/dashboards/home",method="GET",status_code="200",le="0.005"} 2 # {traceID="5185d22962f8ac4094cfd03f693f582f"} 0.00116207 1.692086241234072e+09
Note that the above data point contains a traceID
attribute for the exemplar.
After going through the Collector, the /metrics
endpoint of the Prometheus exporter returns them like this (when requested with an Accept
header that includes application/opentelemetry-text
):
grafana_http_request_duration_seconds_bucket{handler="/api/dashboards/home",http_scheme="http",instance="grafana:3000",job="grafana",method="GET",net_host_name="grafana",net_host_port="3000",service_instance_id="grafana:3000",service_name="grafana",status_code="200",le="0.005"} 2 1.692094186319e+09 # 0.00116207 1.692086241234e+09
Note how the traceID
(and the entire attribute set) is gone from the exemplar output.
Application traces, including traces from Grafana, go through the same collector (but a different pipeline, ofc) and I looked up the traceID
from the original Grafana output. I thought maybe the trace IDs were filtered for only traces that are exported and maybe there was an issue there, but the trace successfully made it to the tracing backend.
The metrics including exemplars are processed correctly when scraping the Grafana endpoint directly from Prometheus, so given the two outputs it definitely looks like the traceID
info gets lost somewhere in the collector.
Component(s)
exporter/prometheus
What happened?
Description
I'm moving my issue from https://github.com/open-telemetry/opentelemetry-collector/issues/7392, since it seems this is a more suitable location.
I have a system with services, collector (latest), and prometheus (2.42.0) for metrics aggregation.
The collector is up and running and returns metrics on /metrics endpoint. It worked for several hours and Prometheus was successfully receiving data. But then after some point in time it started to display this error:
Url: url:9201/metrics State: down Error: expected next entry after timestamp, got "INVALID"
Steps to Reproduce
I can't reproduce it on small sample, but here is my config for prometheus:
If I request url:9201/metrics with application/openmetrics-text header it returns metrics without problem
Expected Result
Collector should always produce valid output on /metrics endpoint for prometheus
Actual Result
It produce something, but prometheus unable to parse it and returns 'expected next entry after timestamp, got "INVALID"'
Collector version
otel/opentelemetry-collector:latest@sha256:69b1cde6062a8ad68fa025fb240b5ec3efbb29205a0538924683471439a73353
Environment information
Environment
I use provided image
OpenTelemetry Collector configuration
Log output
No response
Additional context
I also found something new today. I got source code from Prometheus and run their parser tests on open collector output.
Most of my metrics looks like this:
as you can see some of them have exemplars with traces that look pretty normal.
But then I found part like this:
As you can see some of exemplars have empty tags, without {}, only with numbers. I guess this is the place where Prometheus breaks.
Maybe one of my services reported exemplars in the wrong format, but anyway collector should just ignore such traces and always produce the correct output on /metrics for Prometheus.