[prometheus exporter] Metric value overflow on metrics_expiration

albertteoh commented 2 years ago

Describe the bug When a metric expires, the value for the metric appears to overflow the max int64 value: 9223372036854775808.

Steps to reproduce

Start otel collector with provided config below.
Start HotROD example: docker run --rm --network="host" --env JAEGER_AGENT_HOST=localhost --env JAEGER_AGENT_PORT=6835 -p8080-8083:8080-8083 jaegertracing/example-hotrod:latest all
1. This will emit spans to otel collector on port 6835.
Click on a button on HotROD example to send a trace.
Check that the latency_bucket metrics are correct for the first minute of data, and make a note of which bucket has count > 0. For example, where the metric has label le = "250".
1. Example query: latency_bucket{service_name = "driver", le="250"}

What did you expect to see?

After a minute (on metric expiry), the metric with le = "250" (example query latency_bucket{service_name = "driver", le="250"}) should no longer be query-able (a null value).

What did you see instead?

After a minute (on metric expiry), the metric with le = "250" (example query latency_bucket{service_name = "driver", le="250"}) will jump to 9223372036854775808.

What version did you use? Version: main branch

What config did you use? Config:

receivers:
  jaeger:
    protocols:
      thrift_binary:
        endpoint: "localhost:6836"
      thrift_compact:
        endpoint: "localhost:6835"
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: :12345
  otlp:
    protocols:
      grpc:
        endpoint: :4317
  prometheus:
    config:
      global:
        external_labels:
          p8s_logzio_name: spm-demo-otel
      scrape_configs: 
      - job_name: 'atm'
        scrape_interval: 15s
        static_configs:
        - targets: [ "0.0.0.0:8889" ]
exporters:
  jaeger:
    endpoint: "localhost:14250"
    tls:
      insecure: true
  prometheus:
    endpoint: "localhost:8889"
    metric_expiration: 1m
  logging:
    loglevel: debug
processors:
  batch:
  spanmetrics:
    metrics_exporter: prometheus
    latency_histogram_buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 1000ms, 10000ms, 100000ms, 1000000ms]
    # Additional list of dimensions on top of:
    # - service.name
    # - operation
    # - span.kind
    # - status.code
    dimensions:
      # If the span is missing http.method, the processor will insert
      # the http.method dimension with value 'GET'.
      # For example, in the following scenario, http.method is not present in a span and so will be added as a dimension to the metric with value "GET":
      # - promexample_calls{http_method="GET",operation="/Address",service_name="shippingservice",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
      - name: http.method
        default: GET
      # If a default is not provided, the http.status_code dimension will be omitted
      # if the span does not contain http.status_code.
      # For example, consider a scenario with two spans, one span having http.status_code=200 and another missing http.status_code. Two metrics would result with this configuration, one with the http_status_code omitted and the other included:
      # - promexample_calls{http_status_code="200",operation="/Address",service_name="shippingservice",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
      # - promexample_calls{operation="/Address",service_name="shippingservice",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
      - name: http.status_code    
extensions:
  pprof:
    endpoint: :1777
  zpages:
    endpoint: :55679
  health_check:
service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [jaeger]
      processors: [spanmetrics,batch]
      exporters: [jaeger, logging]
    metrics/spanmetrics:
      # This receiver is just a dummy and never used.
      # Added to pass validation requiring at least one receiver in a pipeline.
      receivers: [otlp/spanmetrics]
      exporters: [prometheus]
    metrics:
      receivers: [otlp,prometheus]
      exporters: [logging]      
  telemetry:
    logs:
      level: "debug"

Environment OS: "Ubuntu 20.04" Compiler(if manually compiled): "go 1.17.5")

Additional context

Is there something the spanmetrics processor should be doing to avoid this problem?
Is it perhaps a config or a promeetheus query issue?
Is this expected behaviour and is the recommendation to simply increase the metric_expiration duration?

The following screenshots illustrate the problem for the default 5m metric_expiration configuration:

Before Expiry

After Expiry

The logs also reflect the above screenshots, showing the correct metrics initially, then suddenly jumping to a very large count:

Initially, the le = "250" bucket has a count of 3.
5 mins later without sending any traces, a number of logs indicate the expiry of the metrics.
At this time, le = "250" bucket jumps to a very large number, which appears to be 1 + MaxInt64.
Other buckets remain at 0.

jpkrohling commented 2 years ago

@gouthamve, is this something you can help with?

albertteoh commented 2 years ago

Thanks @jpkrohling. 🙏🏼

I have more evidence to suggest this is a bug in prometheusremotewrite exporter (I've updated the issue description with this exporter name) because I ran a local prometheus server off the same pipeline but scraping the prometheus exporter's data and it looks fine.

The following are screenshots of exactly the same data, with identical queries and timestamp ranges.

The first screenshot is from a locally running Grafana querying against a local prometheus server scraping data from my local OTEL collector's prometheusexporter.

The second screenshot is from an observability provider (M3 is the backing metrics store) where by the prometheusremotewrite exporter is configured to send metrics to, and the visualisation is consistent with the numbers I see in my local otel collector logs, which confirms the problems resides within the prometheusremotewrite exporter, whether if it's a bug of misconfiguration.

prometheus exporter

Looks fine, the data is null after 21:04:30.

prometheus remote write exporter

Logs show the number jumping to the MaxInt64 value as reported earlier:

Data point attributes:
     -> http_method: STRING(GET)
     -> http_status_code: STRING(200)
     -> operation: STRING(HTTP GET /route)
     -> p8s_logzio_name: STRING(spm-demo-otel)
     -> service_name: STRING(route)
     -> span_kind: STRING(SPAN_KIND_SERVER)
     -> status_code: STRING(STATUS_CODE_UNSET)
StartTimestamp: 2021-12-28 10:03:16.465 +0000 UTC
Timestamp: 2021-12-28 10:04:16.465 +0000 UTC
Count: 50
Sum: 2652.758000
ExplicitBounds #0: 2.000000
ExplicitBounds #1: 6.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 100.000000
ExplicitBounds #4: 250.000000
ExplicitBounds #5: 500.000000
ExplicitBounds #6: 1000.000000
ExplicitBounds #7: 10000.000000
ExplicitBounds #8: 100000.000000
ExplicitBounds #9: 1000000.000000
ExplicitBounds #10: 9223372036854.775391
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 50
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0
Buckets #11, Count: 0
...
2021-12-28T10:04:31.166Z    debug   prometheusexporter@v0.41.0/accumulator.go:246   metric expired: latency, deleting key: Histogram*spanmetricsprocessor*latency*http.method*GET*http.status_code*200*operation*HTTP GET /route*service.name*route*span.kind*SPAN_KIND_SERVER*status.code*STATUS_CODE_UNSET    {"kind": "exporter", "name": "prometheus"}
...
Data point attributes:
     -> http_method: STRING(GET)
     -> http_status_code: STRING(200)
     -> operation: STRING(HTTP GET /route)
     -> p8s_logzio_name: STRING(spm-demo-otel)
     -> service_name: STRING(route)
     -> span_kind: STRING(SPAN_KIND_SERVER)
     -> status_code: STRING(STATUS_CODE_UNSET)
StartTimestamp: 2021-12-28 10:04:31.465 +0000 UTC
Timestamp: 2021-12-28 10:04:31.465 +0000 UTC
Count: 9223372036854775808
Sum: NaN
ExplicitBounds #0: 2.000000
ExplicitBounds #1: 6.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 100.000000
ExplicitBounds #4: 250.000000
ExplicitBounds #5: 500.000000
ExplicitBounds #6: 1000.000000
ExplicitBounds #7: 10000.000000
ExplicitBounds #8: 100000.000000
ExplicitBounds #9: 1000000.000000
ExplicitBounds #10: 9223372036854.775391
Buckets #0, Count: 9223372036854775808
Buckets #1, Count: 9223372036854775808
Buckets #2, Count: 9223372036854775808
Buckets #3, Count: 9223372036854775808
Buckets #4, Count: 9223372036854775808
Buckets #5, Count: 9223372036854775808
Buckets #6, Count: 9223372036854775808
Buckets #7, Count: 9223372036854775808
Buckets #8, Count: 9223372036854775808
Buckets #9, Count: 9223372036854775808
Buckets #10, Count: 9223372036854775808
Buckets #11, Count: 9223372036854775808
...

Aneurysm9 commented 2 years ago

I'm confused. The issue title says "prometheusremotewrite", but the sample config uses the promethus exporter. The prometheusremotewrite exporter does not have the referenced metric_expiration configuration option.

It looks like what is happening here is that the data flow is something like spanmetrics->promexp->promrecv->log. The prometheus exporter is correctly expiring the metrics and stops emitting them. At that point the Prometheus scrape manager used by the prometheus receiver emits a signalling NaN value to indicate that the metric is stale. Because of a data model mismatch between Prometheus and OTLP, that value for bucket and distribution counts cannot be properly represented since OTLP uses integers for those values. There is a flag in the OTLP data model that is intended to convey the same information, but it is not currently set by the prometheus receiver. The prometheusremotewrite receiver has been updated to recognize this flag and emit the Prometheus signaling NaN. There is a PR to do the same for the prometheus exporter (or, more correctly, to delete metrics with that flag which will cause the scraper retrieving those metrics to emit the signalling NaN as appropriate). We had intended to update the promethus receiver to begin setting those flags once they were correctly handled in both receivers.

Aneurysm9 commented 2 years ago

@albertteoh can you try building a collector from this branch and let me know if the issue persists?

albertteoh commented 2 years ago

I'm confused. The issue title says "prometheusremotewrite", but the sample config uses the promethus exporter. The prometheusremotewrite exporter does not have the referenced metric_expiration configuration option.

Ah, sorry, silly mistake on my part. I had originally titled it with promethus exporter, but I jumped to an incorrect conclusion in my comment especially given the logs are in the same export group as the prometheusremotewrite exporter. 🤦🏼‍♂️ I've reverted the title name change for posterity.

can you try building a collector from this branch and let me know if the issue persists?

I tried your branch and looks good 👍🏼

Thanks very much for the quick turnaround, @Aneurysm9!

open-telemetry / opentelemetry-collector-contrib

[prometheus exporter] Metric value overflow on metrics_expiration #6935