open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

Inconsistent results between prometheusexporter and prometheusremotewrite #4975

Closed albertteoh closed 1 year ago

albertteoh commented 3 years ago

Describe the bug When using prometheusremotewrite to export metrics to M3, I'm getting latencies that are over 200 years when queried from M3.

However, when scraping these metrics from prometheus, the latencies look correct.

Am I configuring something incorrectly?

Steps to reproduce

  1. Run opentelemetry-collector-contrib with the config attached in this issue.
  2. Send spans to the collector using omnition/synthetic-load-generator, and with the spanmetrics processor sending metrics to a prometheus exporter.
  3. Have a prometheus server running to scrape metrics from the prometheus exporter.
  4. Within opentelemetry collector, a prometheus receiver will also scrape metrics from the same prometheus exporter above and send it to the configured prometheusremotewrite exporter that will write to an M3 instance.
  5. Setup grafana to query both data sources (prometheus server and M3) and graph both timeseries with the same query.

What did you expect to see? Identical 95th percentile latencies, or at least close enough to one another.

What did you see instead? Latencies from M3 were over 200 years, whereas from Prometheus, they were a more sensible ~200ms.

Here are two screenshots of the same query executed against Prometheus and M3 data sources respectively:

Prometheus

Screen Shot 2021-05-08 at 10 30 09 pm

M3

Screen Shot 2021-05-08 at 10 29 54 pm

To reduce the search space by ruling out M3 and spanmetrics processor as possible causes, I also checked the logs (these are from an earlier run):

Here, I log the total latency_count as well as the latency_bucket counts within spanmetrics processor. I've taken logs from two different times, 10 seconds apart and as you can see, the count is consistent with the sum of bucket_counts:

2021-05-08T18:54:20.947+1000    debug   spanmetricsprocessor@v0.0.0-00010101000000-000000000000/processor.go:258        Latency metrics {"kind": "processor", "name": "spanmetrics", "key": "frontend\u0000/checkout\u0000SPAN_KIND_CLIENT\u0000STATUS_CODE_UNSET", "count": 2, "bucket_count": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0]}

2021-05-08T18:54:30.958+1000    debug   spanmetricsprocessor@v0.0.0-00010101000000-000000000000/processor.go:258        Latency metrics {"kind": "processor", "name": "spanmetrics", "key": "frontend\u0000/checkout\u0000SPAN_KIND_CLIENT\u0000STATUS_CODE_UNSET", "count": 3, "bucket_count": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0]}

However, this is the log output from the last metrics pipeline in the config below, i.e.:

    metrics:
      receivers: [prometheus]
      exporters: [prometheusremotewrite, logging]

As you can see the total count is 1 but the bucket count total is 2 + 1 = 3, and so I believe the +Inf tries to account for this discrepancy, resulting in -2 represented as the uint64 equivalent of 18446744073709551614. I have also seen versions in logs where the total count > sum of bucket counts, leading to a "positive" spillover +Inf count.

HistogramDataPoints open-telemetry/opentelemetry-collector#12
Data point labels:
     -> operation: /checkout
     -> service_name: frontend
     -> span_kind: SPAN_KIND_CLIENT
     -> status_code: STATUS_CODE_UNSET
StartTimestamp: 2021-05-08 08:54:22.433 +0000 UTC
Timestamp: 2021-05-08 08:54:32.437 +0000 UTC
Count: 1
Sum: 1708.000000
ExplicitBounds #0: 2.000000
ExplicitBounds open-telemetry/opentelemetry-collector#1: 6.000000
ExplicitBounds open-telemetry/opentelemetry-collector#2: 10.000000
ExplicitBounds open-telemetry/opentelemetry-collector#3: 100.000000
ExplicitBounds open-telemetry/opentelemetry-collector#4: 250.000000
ExplicitBounds open-telemetry/community#39: 300.000000
ExplicitBounds open-telemetry/opentelemetry-collector#6: 400.000000
ExplicitBounds open-telemetry/opentelemetry-collector#7: 800.000000
ExplicitBounds open-telemetry/opentelemetry-collector#8: 1000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#9: 1400.000000
ExplicitBounds open-telemetry/opentelemetry-collector#10: 2000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#11: 5000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#12: 15000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#13: 30000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#14: 120000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#15: 9223372036854.000000
Buckets #0, Count: 0
Buckets open-telemetry/opentelemetry-collector#1, Count: 0
Buckets open-telemetry/opentelemetry-collector#2, Count: 0
Buckets open-telemetry/opentelemetry-collector#3, Count: 0
Buckets open-telemetry/opentelemetry-collector#4, Count: 0
Buckets open-telemetry/community#39, Count: 0
Buckets open-telemetry/opentelemetry-collector#6, Count: 0
Buckets open-telemetry/opentelemetry-collector#7, Count: 0
Buckets open-telemetry/opentelemetry-collector#8, Count: 0
Buckets open-telemetry/opentelemetry-collector#9, Count: 0
Buckets open-telemetry/opentelemetry-collector#10, Count: 2
Buckets open-telemetry/opentelemetry-collector#11, Count: 1
Buckets open-telemetry/opentelemetry-collector#12, Count: 0
Buckets open-telemetry/opentelemetry-collector#13, Count: 0
Buckets open-telemetry/opentelemetry-collector#14, Count: 0
Buckets open-telemetry/opentelemetry-collector#15, Count: 0
Buckets open-telemetry/opentelemetry-collector#16, Count: 18446744073709551614

What version did you use? Version: opentelemetry-collector-contrib@master

What config did you use? Config: (e.g. the yaml config file)

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'atm'
        scrape_interval: 10s
        static_configs:
        - targets: [ "0.0.0.0:8889" ]

  jaeger:
    protocols:
      thrift_http:
        endpoint: "0.0.0.0:14278"

  # Dummy receiver that's never used, because a pipeline is required to have one.
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:65535"

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

  logging:
    loglevel: debug

  jaeger:
    endpoint: "localhost:14250"
    insecure: true

  prometheusremotewrite:
    endpoint: "http://localhost:7201/api/v1/prom/remote/write"
    insecure: true

  otlp/spanmetrics:
    endpoint: "localhost:55677"
    insecure: true

processors:
  batch:
  spanmetrics:
    metrics_exporter: prometheus
    # default (in ms): [2, 4, 6, 8, 10, 50, 100, 200, 400, 800, 1000, 1400, 2000, 5000, 10_000, 15_000]
    latency_histogram_buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 300ms, 400ms, 800ms, 1s, 1.4s, 2s, 5s, 15s, 30s, 120s]

extensions:
  health_check:
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [jaeger]
      processors: [spanmetrics]
      exporters: [jaeger]
    # The exporter name must match the metrics_exporter name.
    # The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
    metrics/spanmetrics:
      receivers: [otlp/spanmetrics]
      exporters: [prometheus, logging]
    metrics:
      receivers: [prometheus]
      exporters: [prometheusremotewrite, logging]

Environment OS: MacOS Compiler(if manually compiled): go 1.16

Additional context cc @bogdandrutu

ankitnayan commented 3 years ago

@albertteoh Could you make this work? Or any workaround?

albertteoh commented 2 years ago

Hi @ankitnayan, my workaround was to filter out any latencies > 24 hours, which isn't nice but it does the job for my use case at least.

luistilingue commented 2 years ago

@albertteoh I have a similar issue, but using Prometheus.

spanMetricsProcessor is creating such bucket le="9.223372036854775e+12":

latency_bucket{http_status_code="200",operation="/health/xxxxx/health/**",service_name="xxxxx",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="9.223372036854775e+12"} 1

I guess the code of spanMetricsProcessor need to deal with golang number conversion when using float64.

Please have a look at these lines https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/spanmetricsprocessor/processor.go#L140

I've simulated here and the behavior was the same.

So, when Prometheus read that "number" it behavior in such way. I don't know if this doc can help.

ankitnayan commented 2 years ago

@luistilingue I upgraded to v0.43.0 and it seems to have been fixed. Which version are you using?

luistilingue commented 2 years ago

@ankitnayan I was using 0.40.0 and I upgraded all my stack (otel collector to 0.46.0, prometheus to 2.33.5, and javaagent to 1.11.1), but the issue still occurs.

balintzs commented 2 years ago

Hello all, I believe this is caused by a bug we found in prometheus that causes the +Inf bucket to be added incorrectly, which in turn results in a negative number when converting cumulative datapoints to delta: https://github.com/prometheus/client_golang/issues/1147

We faced an issue whereby New Relic dropped our datapoints because of this. The issue existed with 0.61.0 but became much worse with 0.62.0. We built a custom image updating github.com/prometheus/client_golang to the SHA version (dcea97eee2b3257f34fd3203cb922eedeabb42a6) that contained our fix and the issue disappeared: Screenshot 2022-10-19 at 12 48 49

cc @TylerHelmuth

balintzs commented 2 years ago

Actually, looking at it, it is not the same bug, but the same issue: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/translator/prometheusremotewrite/helper.go#L350

The +Inf bucket should have the same value as the total count: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/translator/prometheusremotewrite/helper.go#L306

github-actions[bot] commented 2 years ago

Pinging code owners: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 2 years ago

Pinging code owners: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.