grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.91k stars 509 forks source link

Issue while getting traces with Jaeger + tempo + AWS s3 set up #1980

Closed vaish1707 closed 1 year ago

vaish1707 commented 1 year ago

We have been working on setting up https://github.com/grafana/tempo with Amazon managed Grafana. The purpose of using tempo is to store jaeger traces into s3 and to visualise traces and in both jaeger and grafana.. I have installed tempo in AWS EKS cluster with s3 as the backend storage and I'm facing some issues with tempo

While trying to search for traces with respect to a particular service from Jaeger, the request is sent to tempo-query service to get the traces from s3. This request is returning all the traces present rather than returning traces just for one particular service. The tempo API is returning 404 though the traces are present in s3 in around 2-3 hours from the time trace got produced and stored in s3. Things works well from the time trace produced and until 2-3 hours. And after the tempo API returns a 404 though the traces are present in s3

To Reproduce Steps to reproduce the behavior: Deploy tempo-distributed in kubernetes cluster by using https://github.com/grafana/helm-charts/tree/main/charts/tempo-distributed with the following configuration

  traces:
    jaeger:
      grpc:
        enabled: true
      thriftHttp:
        enabled: true
  minio:
    enabled: false
  queryFrontend:
    query:
      enabled: true

  tempo:
    securityContext:
      allowPrivilegeEscalation: true
      readOnlyRootFilesystem: false
  config: |
    query_frontend:
      search:
        max_duration: 0
    multitenancy_enabled: false
    search_enabled: true
    compactor:
      compaction:
        block_retention: 1440h
    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_compact:
              endpoint: 0.0.0.0:6831
            thrift_binary:
              endpoint: 0.0.0.0:6832
            thrift_http:
              endpoint: 0.0.0.0:14268
            grpc:
              endpoint: 0.0.0.0:14250
        otlp:
          protocols:
            http:
              endpoint: 0.0.0.0:55681
            grpc:
              endpoint: 0.0.0.0:4317
    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095
    ingester:
      lifecycler:
        ring:
          replication_factor: 1
          kvstore:
            store: memberlist
        tokens_file_path: /var/tempo/tokens.json
    memberlist:
      abort_if_cluster_join_fails: false
      join_members:
        - tempo-gossip-ring
    overrides:
      max_search_bytes_per_trace: 0
      per_tenant_override_config: /conf/overrides.yaml
    server:
      http_listen_port: 3100
      log_level: info
      grpc_server_max_recv_msg_size: 4.194304e+06
      grpc_server_max_send_msg_size: 4.194304e+06
    storage:
      trace:
        backend: s3
        s3:
          bucket:                    # how to store data in s3
          endpoint: s3.dualstack.us-east-1.amazonaws.com
          access_key: <accesskey>
          secret_key: <secret key>
        blocklist_poll: 5m
        wal:
          path: /var/tempo/wal

Expected behavior Tempo needs to successfully retrieve traces when queried from Jaeger and also need to get the stored traces from s3.

Environment:

joe-elliott commented 1 year ago

Thanks for filing this issue. A few questions to better understand what may be going on here:

vaish1707 commented 1 year ago

@joe-elliott , I have formatted the config as you asked for.

joe-elliott commented 1 year ago

I think I'm starting to understand better. There are two issues here.

Query for service returning traces that don't match

Can you share the queries that are being executed that are returning all traces and not just expected traces? The query-frontend should do some request/response logging like this:

level=info ts=2023-01-11T15:52:59.641178922Z caller=handler.go:124 tenant=vulture-tenant method=GET traceID=7feabcfa5b556718 url="/tempo/api/search?tags=vulture-2%3DtNPpslYwNuTYQ&start=1673097760&end=1673101360" duration=119.048467ms response_size=252 status=200

Also, to be clear, If you search Tempo for service.name="foo" it will return all traces that have a span that originates from that service (not just the traces rooted in that service).

Traces disappearing after 2-3 hours The compactor is the component responsible for clearing out old blocks. Can you check compactor logs to see if it's deleting older blocks? Also check the config /status/config of compactors and confirm that the expected retention is set:

compactor:
  compaction:
    block_retention: ??

Can you double check s3 to confirm that the older blocks exist and some other process is not removing them?

Both! Are there any unexpected errors or warnings in your logs that might give us a clue as to what is happening?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.