grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.02k stars 521 forks source link

distributor not pushing traces to ingestor #4102

Closed MrMegaMango closed 1 month ago

MrMegaMango commented 1 month ago

Describe the bug pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0"

kubectl get pods:

tempo-distributed-compactor-569cd575bd-dwmp2        1/1     Running                  0               11m
tempo-distributed-distributor-76d5f9d59f-rhnf8      1/1     Running                  0               11m
tempo-distributed-ingester-0                        1/1     Running                  0               9m19s
tempo-distributed-ingester-1                        1/1     Running                  0               10m
tempo-distributed-ingester-2                        1/1     Running                  0               11m
tempo-distributed-memcached-0                       1/1     Running                  0               6d23h
tempo-distributed-querier-69788c648d-qq599          1/1     Running                  0               11m
tempo-distributed-query-frontend-6c976d8854-h28wk   1/1     Running                  0               11m

➜ ~ kubectl logs tempo-distributed-distributor-76d5f9d59f-rhnf8 -n monitoring
level=warn ts=2024-09-20T12:47:05.762701192Z caller=main.go:131 msg="-- CONFIGURATION WARNINGS --" level=warn ts=2024-09-20T12:47:05.76324662Z caller=main.go:137 msg="Local backend will not correctly retrieve traces with a distributed deployment unless all components have access to the same disk. You should probably be using object storage as a backend." level=info ts=2024-09-20T12:47:05.76327724Z caller=main.go:226 msg="initialising OpenTracing tracer" level=info ts=2024-09-20T12:47:05.776481322Z caller=main.go:119 msg="Starting Tempo" version="(version=2.6.0, branch=HEAD, revision=e85bbc57d)" level=info ts=2024-09-20T12:47:05.780640686Z caller=server.go:249 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095 level=info ts=2024-09-20T12:47:05.785880792Z caller=memberlist_client.go:439 msg="Using memberlist cluster label and node name" cluster_label= node=tempo-distributed-distributor-76d5f9d59f-rhnf8-f088e6ee ts=2024-09-20T12:47:05Z level=info msg="OTel Shim Logger Initialized" component=tempo level=info ts=2024-09-20T12:47:05.796523206Z caller=module_service.go:82 msg=starting module=internal-server level=info ts=2024-09-20T12:47:05.7969655Z caller=module_service.go:82 msg=starting module=server level=info ts=2024-09-20T12:47:05.79823156Z caller=module_service.go:82 msg=starting module=memberlist-kv level=info ts=2024-09-20T12:47:05.798699675Z caller=module_service.go:82 msg=starting module=overrides level=info ts=2024-09-20T12:47:05.798964436Z caller=module_service.go:82 msg=starting module=usage-report level=info ts=2024-09-20T12:47:05.79923718Z caller=module_service.go:82 msg=starting module=metrics-generator-ring level=info ts=2024-09-20T12:47:05.799286763Z caller=module_service.go:82 msg=starting module=ring level=info ts=2024-09-20T12:47:05.800736769Z caller=ring.go:297 msg="ring doesn't exist in KV store yet" level=info ts=2024-09-20T12:47:05.802133956Z caller=ring.go:297 msg="ring doesn't exist in KV store yet" level=info ts=2024-09-20T12:47:05.802236536Z caller=module_service.go:82 msg=starting module=distributor ts=2024-09-20T12:47:05Z level=warn msg="Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning." component=tempo documentation=https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks feature gate ID=component.UseLocalHostAsDefaultHost ts=2024-09-20T12:47:05Z level=info msg="Starting GRPC server" component=tempo endpoint=0.0.0.0:4317 ts=2024-09-20T12:47:05Z level=warn msg="Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning." component=tempo documentation=https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks feature gate ID=component.UseLocalHostAsDefaultHost ts=2024-09-20T12:47:05Z level=info msg="Starting HTTP server" component=tempo endpoint=0.0.0.0:4318 level=info ts=2024-09-20T12:47:05.804175132Z caller=app.go:208 msg="Tempo started" level=error ts=2024-09-20T12:48:28.695047491Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:48:29.736100546Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:48:31.811067583Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:48:35.842725529Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:48:43.898817836Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:48:59.944688999Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:49:31.99313536Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:49:33.008964589Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:49:35.03315418Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:49:39.078785127Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:49:47.146666092Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:50:03.202087895Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:50:35.249169094Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0" level=error ts=2024-09-20T12:50:36.281291136Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0"

my config:

{{ if .Values.tempo.enabled }}
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: tempo
  namespace: argocd 
spec:
  project: default
  source:
    repoURL: https://grafana.github.io/helm-charts
    chart: tempo-distributed
    targetRevision: 1.18.0 
    helm:
      releaseName: tempo-distributed
      values: |
        service:
          type: ClusterIP
        config: |
          storage:
            trace:
              backend: local 
              local:
                path: /var/tempo/traces
          querier:
            frontend_worker:
                frontend_address: tempo-distributed-query-frontend.monitoring.svc.cluster.local:9095
          server:
            http_listen_port: 3100
          distributor:
            ring:
              kvstore:
                store: memberlist
            receivers:
              otlp:
                protocols:
                  grpc: 
                  http:
          ingester:
            lifecycler:
              ring:
                replication_factor: 1
                kvstore:
                  store: memberlist

        traces:
          otlp:
            grpc: 
              enabled: true
            http: 
              enabled: true
        ingester:
          replicas: 3
          config:
            replication_factor: 1

      # optional: use storage config to use other blob storage like s3
  destination:
    server: 'https://kubernetes.default.svc'
    namespace: monitoring  
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
{{ end }}
mapno commented 1 month ago

Hi! It looks like the ingesters are not properly registering to the ring. You can check in /ingester/ring (docs).

I believe you're missing a memberlist section to configure Tempo how to join the ring

  memberlist:
    abort_if_cluster_join_fails: false
    join_members:
      - {{ include "tempo.fullname" . }}-memberlist

You can see an example here: https://github.com/grafana/helm-charts/tree/main/charts/tempo-distributed#example-configuration-using-s3-for-storage

MrMegaMango commented 1 month ago

Hi @mapno Thanks for the response, I figured so and added

          memberlist:
            join_members:
              - dns+tempo-distributed-gossip-ring:7946

now it should be ok Image

I only asked thinking it should work out of the box a bit better without more investigating