grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.98k stars 514 forks source link

generator push to prometheus pushgateway always fails "HTTP status 400 Bad Request: snappy: corrupt input" #3921

Closed andrewbulin closed 1 week ago

andrewbulin commented 3 months ago

Describe the bug

Tempo metrics-generator always failed to push to Prometheus push gateway with a snappy error:

ts=2024-07-30T08:40:27.386622385Z caller=dedupe.go:112 tenant=single-tenant component=remote level=error remote_name=f56174 url=http://prometheus-for-amp-prometheus-pushgateway.prometheus.svc.cluster.local:9091/metrics/job/tempo-metrics-generator msg="non-recoverable error" count=1099 exemplarCount=0 err="server returned HTTP status 400 Bad Request: snappy: corrupt input"

Any tips or recommendations to help debug this would be appreciated.

To Reproduce

Helm tempo values for metrics-generator:

metricsGenerator:
  enabled: true
  kind: Deployment
  replicas: 1
  terminationGracePeriodSeconds: 300
  persistence:
    enabled: false
  walEmptyDir: {}
  ports:
    - name: grpc
      port: 9095
      service: true
    - name: http-memberlist
      port: 7946
      service: false
    - name: http-metrics
      port: 3100
      service: true
  config:
    registry:
      collection_interval: 15s
      stale_duration: 15m
    storage:
      path: /var/tempo/wal
      wal:
        wal_compression: "snappy"
      remote_write_flush_deadline: 1m
      remote_write:
        - url: "http://prometheus-for-amp-prometheus-pushgateway.prometheus.svc.cluster.local:9091/metrics/job/tempo-metrics-generator"
    traces_storage:
      path: /var/tempo/traces
    metrics_ingestion_time_range_slack: 30s

Expected behavior

Metrics should just push.

I can confirm from inside the cluster that push completes from the tempo namespace, via cURL:

echo "some_metric 3.14" | curl --data-binary @- http://prometheus-for-amp-prometheus-pushgateway.prometheus.svc.cluster.local:9091/metrics/job/tempo-metrics-generator

~ $ curl -s http://prometheus-for-amp-prometheus-pushgateway.prometheus.svc.cluster.local:9091/metrics | grep tempo-metrics-generator
push_failure_time_seconds{instance="",job="tempo-metrics-generator"} 0
push_time_seconds{instance="",job="tempo-metrics-generator"} 1.7223294456839218e+09
some_metric{instance="",job="tempo-metrics-generator"} 3.14

Environment:

javiermolinar commented 3 months ago

Hi!, I wonder if the "snappy" wal_compression is needed at all. In fact, running it locally, I get the error running your curl request

echo "some_metric 3.14" | curl --data-binary @- http://localhost:9090/api/v1/write
snappy: corrupt input

You need to configure Prometheus to accept remote-write with this flag:

-web.enable-remote-write-receiver

and

remote_write:
  - url: "http://remote-storage-endpoint/api/v1/write"

You can take a look at the example here: https://github.com/grafana/tempo/blob/d36cc9f22714b46cd6c31123a7ee1f48b464cfb7/example/docker-compose/shared/tempo.yaml#L35

and here: https://github.com/grafana/tempo/blob/d36cc9f22714b46cd6c31123a7ee1f48b464cfb7/example/docker-compose/local/docker-compose.yaml#L39

andrewbulin commented 2 months ago

Thanks for replying! ^_^

Hi!, I wonder if the "snappy" wal_compression is needed at all. In fact, running it locally, I get the error running your curl request

Yeah, I agree this setting neither helps no hurts it seems either way. We can remove if it simplifies testing.

You need to configure Prometheus to accept remote-write with this flag:

I saw that, but I think the critical detail of a "pushgateway" is missing. For example, the defaults in example/docker-compose/distributed also just work as you describe. And I think I have a better reproduction to match my usage here if you make some small changes to the example/docker-compose/distributed/ directory:

{cat << EOF
diff --git a/example/docker-compose/distributed/docker-compose.yaml b/example/docker-compose/distributed/docker-compose.yaml
index abf51e32a..c3d5bcc91 100644
--- a/example/docker-compose/distributed/docker-compose.yaml
+++ b/example/docker-compose/distributed/docker-compose.yaml
@@ -146,6 +146,11 @@ services:
     ports:
       - "9090:9090"

+  pushgateway:
+    image: prom/pushgateway:latest
+    ports:
+      - "9091:9091"
+
   grafana:
     image: grafana/grafana:11.0.0
     volumes:
diff --git a/example/docker-compose/distributed/prometheus.yaml b/example/docker-compose/distributed/prometheus.yaml
index 439e48ce6..6c7e28f70 100644
--- a/example/docker-compose/distributed/prometheus.yaml
+++ b/example/docker-compose/distributed/prometheus.yaml
@@ -17,3 +17,7 @@ scrape_configs:
         - 'querier:3200'
         - 'query-frontend:3200'
         - 'metrics-generator:3200'
+  - job_name: 'prometheus-pushgateway'
+    static_configs:
+      - targets: [ 'pushgateway:9091' ]
+
diff --git a/example/docker-compose/distributed/tempo-distributed.yaml b/example/docker-compose/distributed/tempo-distributed.yaml
index d9134ebdf..db4b58f42 100644
--- a/example/docker-compose/distributed/tempo-distributed.yaml
+++ b/example/docker-compose/distributed/tempo-distributed.yaml
@@ -43,7 +43,8 @@ metrics_generator:
   storage:
     path: /var/tempo/generator/wal
     remote_write:
-      - url: http://prometheus:9090/api/v1/write
+      # - url: http://prometheus:9090/api/v1/write
+      - url: http://pushgateway:9091/metrics/job/tempo/instance/metrics-generator
         send_exemplars: true

 storage:
@@ -63,4 +64,4 @@ storage:
 overrides:
   defaults:
     metrics_generator:
-      processors: ['service-graphs', 'span-metrics']
\ No newline at end of file
+      processors: ['service-graphs', 'span-metrics']
EOF
} | git apply

With these changes, the metrics-generator logs should show the same snappy corrupt error. There may be a right way to configure the generator module to work correctly with prometheus' pushgateway, but I've yet to find it. 🤔

Maybe it could be something with push metric inconsistencies which can cause 400 errors, reference: https://github.com/prometheus/pushgateway/blob/master/README.md#about-metric-inconsistencies

I'm now also concerned that metrics pushed to a pushgateway are never forgotten until removed: https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway

I guess my questions now are:

  1. Does Tempo generator properly work with Prometheus pushgateway "out of the box", and if so how best to configure?
  2. If Tempo generator works with a Prometheus pushgateway, can it manage removal of old metrics?
  3. Is it best to even use Tempo generator with a Prometheus pushgateway?

In the meantime, ~if~ I can confirm that Tempo metrics generator pushing to directly to the Prometheus server with receiver enabled does work and may be a sufficient workaround for my usecase.

github-actions[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.