Collector generates invalid exemplars output for prometheus

ImoutoChan commented 1 year ago

Component(s)

exporter/prometheus

What happened?

Description

I'm moving my issue from https://github.com/open-telemetry/opentelemetry-collector/issues/7392, since it seems this is a more suitable location.

I have a system with services, collector (latest), and prometheus (2.42.0) for metrics aggregation.

The collector is up and running and returns metrics on /metrics endpoint. It worked for several hours and Prometheus was successfully receiving data. But then after some point in time it started to display this error:

Url: url:9201/metrics State: down Error: expected next entry after timestamp, got "INVALID"

Steps to Reproduce

I can't reproduce it on small sample, but here is my config for prometheus:

- job_name: opentelemetry
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  enable_http2: true
  static_configs:
  - targets:
    - url:9201

If I request url:9201/metrics with application/openmetrics-text header it returns metrics without problem

Expected Result

Collector should always produce valid output on /metrics endpoint for prometheus

Actual Result

It produce something, but prometheus unable to parse it and returns 'expected next entry after timestamp, got "INVALID"'

Collector version

otel/opentelemetry-collector:latest@sha256:69b1cde6062a8ad68fa025fb240b5ec3efbb29205a0538924683471439a73353

Environment information

Environment

I use provided image

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  prometheus:
    endpoint: ":9201"
    send_timestamps: true
    metric_expiration: 180m
    enable_open_metrics: true

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheus]

Log output

No response

Additional context

I also found something new today. I got source code from Prometheus and run their parser tests on open collector output.

Most of my metrics looks like this:

cqrs_request_duration_bucket{job="restaurantschedule",request_name="SetStoreSchedule",request_result="success",request_type="command",le="0.0"} 0 1.679104706576e+09
cqrs_request_duration_bucket{job="restaurantschedule",request_name="SetStoreSchedule",request_result="success",request_type="command",le="5.0"} 0 1.679104706576e+09
...
cqrs_request_duration_bucket{job="restaurantschedule",request_name="SetStoreSchedule",request_result="success",request_type="command",le="250.0"} 1 1.679104706576e+09 # {trace_id="75e7c010e8d9e96f992b7f5a9b48d9eb",span_id="9b16aec40d49f271"} 102.0 1.6790992353793504e+09
...
cqrs_request_duration_bucket{job="restaurantschedule",request_name="SetStoreSchedule",request_result="success",request_type="command",le="+Inf"} 1 1.679104706576e+09
cqrs_request_duration_sum{job="restaurantschedule",request_name="SetStoreSchedule",request_result="success",request_type="command"} 102.0 1.679104706576e+09
cqrs_request_duration_count{job="restaurantschedule",request_name="SetStoreSchedule",request_result="success",request_type="command"} 1 1.679104706576e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query",le="0.0"} 0 1.67910474833e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query",le="5.0"} 44 1.67910474833e+09 # {trace_id="206a8551004d4076427beff12415222c",span_id="8cb2474ea645a7ab"} 1.0 1.6791021231569526e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query",le="10.0"} 48 1.67910474833e+09 # {trace_id="046d303704f31733a27825e68bf4eaab",span_id="eccca9f8548f8618"} 9.0 1.679061599709177e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query",le="25.0"} 49 1.67910474833e+09 # {trace_id="3d9f689f12cc3111948550b97ca0558e",span_id="8e7a9d5a2f0cb1ad"} 21.0 1.6790481419261534e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query",le="50.0"} 49 1.67910474833e+09
...
cqrs_request_duration_bucket{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query",le="250.0"} 49 1.67910474833e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query",le="500.0"} 50 1.67910474833e+09 # {trace_id="be46d031f32ad59bc2276292e6d0f59d",span_id="16e1be3df53b9d54"} 341.0 1.6790481299209695e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query",le="750.0"} 50 1.67910474833e+09
...
cqrs_request_duration_bucket{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query",le="+Inf"} 50 1.67910474833e+09
cqrs_request_duration_sum{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query"} 462.0 1.67910474833e+09
cqrs_request_duration_count{job="restaurantsorting",request_name="GetRestaurantSortingByIdQuery",request_result="success",request_type="query"} 50 1.67910474833e+09

as you can see some of them have exemplars with traces that look pretty normal.

But then I found part like this:

cqrs_request_duration_bucket{job="restaurantsorting",request_name="RemoveObsoleteSpecialPlacementsCommand",request_result="success",request_type="command",le="0.0"} 0 1.67910474833e+09
...
cqrs_request_duration_bucket{job="restaurantsorting",request_name="RemoveObsoleteSpecialPlacementsCommand",request_result="success",request_type="command",le="2500.0"} 1 1.67910474833e+09 #  2052.0 1.6791012020585337e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="RemoveObsoleteSpecialPlacementsCommand",request_result="success",request_type="command",le="5000.0"} 1 1.67910474833e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="RemoveObsoleteSpecialPlacementsCommand",request_result="success",request_type="command",le="7500.0"} 1 1.67910474833e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="RemoveObsoleteSpecialPlacementsCommand",request_result="success",request_type="command",le="10000.0"} 1 1.67910474833e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="RemoveObsoleteSpecialPlacementsCommand",request_result="success",request_type="command",le="+Inf"} 1 1.67910474833e+09
cqrs_request_duration_sum{job="restaurantsorting",request_name="RemoveObsoleteSpecialPlacementsCommand",request_result="success",request_type="command"} 2052.0 1.67910474833e+09
cqrs_request_duration_count{job="restaurantsorting",request_name="RemoveObsoleteSpecialPlacementsCommand",request_result="success",request_type="command"} 1 1.67910474833e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="UpdateRestaurantRatings",request_result="success",request_type="command",le="0.0"} 0 1.67910474833e+09
...
cqrs_request_duration_bucket{job="restaurantsorting",request_name="UpdateRestaurantRatings",request_result="success",request_type="command",le="5000.0"} 1 1.67910474833e+09 #  3745.0 1.6790940037781396e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="UpdateRestaurantRatings",request_result="success",request_type="command",le="7500.0"} 1 1.67910474833e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="UpdateRestaurantRatings",request_result="success",request_type="command",le="10000.0"} 1 1.67910474833e+09
cqrs_request_duration_bucket{job="restaurantsorting",request_name="UpdateRestaurantRatings",request_result="success",request_type="command",le="+Inf"} 1 1.67910474833e+09
cqrs_request_duration_sum{job="restaurantsorting",request_name="UpdateRestaurantRatings",request_result="success",request_type="command"} 3745.0 1.67910474833e+09
cqrs_request_duration_count{job="restaurantsorting",request_name="UpdateRestaurantRatings",request_result="success",request_type="command"} 1 1.67910474833e+09

As you can see some of exemplars have empty tags, without {}, only with numbers. I guess this is the place where Prometheus breaks.

Maybe one of my services reported exemplars in the wrong format, but anyway collector should just ignore such traces and always produce the correct output on /metrics for Prometheus.

github-actions[bot] commented 1 year ago

Pinging code owners:

exporter/prometheus: @Aneurysm9

See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme commented 1 year ago

Please close the original issue.

atoulme commented 1 year ago

What version of prometheus are you using?

ImoutoChan commented 1 year ago

Version | 2.42.0 225c61122d88b01d1f0eaaee0e05b6f3e0567ac0

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheus: @Aneurysm9

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

domenicquirl commented 1 year ago

I think I just encountered this when setting up telemetry infrastructure for a system with a Grafana instance. I had configured Grafana to provide instrumentation metrics via a Prometheus endpoint (which worked like a charm) and later also traces via OTLP.

For simplicity, I had started with a configuration where all telemetry went through the collector. So for metrics there was both a Prometheus receiver that scraped services alongside other receivers and a Prometheus exporter as the connection point for the actual data storage. Enabling trace output from Grafana made it include exemplars in the metric output, which suddenly caused no metrics to appear any more at all.

After some digging, I found that Prometheus complains with the error

ts=2023-08-15T08:37:07.054Z caller=scrape.go:1387 level=debug component="scrape manager" scrape_pool=otel target=http://otel-collector:8889/metrics msg="Append failed" err="expected next entry after timestamp, got \" #  \" (\"INVALID\") while parsing: \"grafana_http_request_duration_seconds_bucket{handler=\\\"/api/dashboards/home\\\",instance=\\\"grafana:3000\\\",job=\\\"grafana\\\",method=\\\"GET\\\",status_code=\\\"200\\\",le=\\\"0.005\\\"} 2 1.69208862132e+09 #  \""

same as in the issue description.

The original metrics returned by the /metrics endpoint of Grafana look like this:

grafana_http_request_duration_seconds_bucket{handler="/api/dashboards/home",method="GET",status_code="200",le="0.005"} 2 # {traceID="5185d22962f8ac4094cfd03f693f582f"} 0.00116207 1.692086241234072e+09

Note that the above data point contains a traceID attribute for the exemplar. After going through the Collector, the /metrics endpoint of the Prometheus exporter returns them like this (when requested with an Accept header that includes application/opentelemetry-text):

grafana_http_request_duration_seconds_bucket{handler="/api/dashboards/home",http_scheme="http",instance="grafana:3000",job="grafana",method="GET",net_host_name="grafana",net_host_port="3000",service_instance_id="grafana:3000",service_name="grafana",status_code="200",le="0.005"} 2 1.692094186319e+09 #  0.00116207 1.692086241234e+09

Note how the traceID (and the entire attribute set) is gone from the exemplar output.

Application traces, including traces from Grafana, go through the same collector (but a different pipeline, ofc) and I looked up the traceID from the original Grafana output. I thought maybe the trace IDs were filtered for only traces that are exported and maybe there was an issue there, but the trace successfully made it to the tracing backend.

The metrics including exemplars are processed correctly when scraping the Grafana endpoint directly from Prometheus, so given the two outputs it definitely looks like the traceID info gets lost somewhere in the collector.

open-telemetry / opentelemetry-collector-contrib