Issue while getting traces with Jaeger + tempo + AWS s3 set up

vaish1707 commented 1 year ago

We have been working on setting up https://github.com/grafana/tempo with Amazon managed Grafana. The purpose of using tempo is to store jaeger traces into s3 and to visualise traces and in both jaeger and grafana.. I have installed tempo in AWS EKS cluster with s3 as the backend storage and I'm facing some issues with tempo

While trying to search for traces with respect to a particular service from Jaeger, the request is sent to tempo-query service to get the traces from s3. This request is returning all the traces present rather than returning traces just for one particular service. The tempo API is returning 404 though the traces are present in s3 in around 2-3 hours from the time trace got produced and stored in s3. Things works well from the time trace produced and until 2-3 hours. And after the tempo API returns a 404 though the traces are present in s3

To Reproduce Steps to reproduce the behavior: Deploy tempo-distributed in kubernetes cluster by using https://github.com/grafana/helm-charts/tree/main/charts/tempo-distributed with the following configuration

  traces:
    jaeger:
      grpc:
        enabled: true
      thriftHttp:
        enabled: true
  minio:
    enabled: false
  queryFrontend:
    query:
      enabled: true

  tempo:
    securityContext:
      allowPrivilegeEscalation: true
      readOnlyRootFilesystem: false
  config: |
    query_frontend:
      search:
        max_duration: 0
    multitenancy_enabled: false
    search_enabled: true
    compactor:
      compaction:
        block_retention: 1440h
    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_compact:
              endpoint: 0.0.0.0:6831
            thrift_binary:
              endpoint: 0.0.0.0:6832
            thrift_http:
              endpoint: 0.0.0.0:14268
            grpc:
              endpoint: 0.0.0.0:14250
        otlp:
          protocols:
            http:
              endpoint: 0.0.0.0:55681
            grpc:
              endpoint: 0.0.0.0:4317
    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095
    ingester:
      lifecycler:
        ring:
          replication_factor: 1
          kvstore:
            store: memberlist
        tokens_file_path: /var/tempo/tokens.json
    memberlist:
      abort_if_cluster_join_fails: false
      join_members:
        - tempo-gossip-ring
    overrides:
      max_search_bytes_per_trace: 0
      per_tenant_override_config: /conf/overrides.yaml
    server:
      http_listen_port: 3100
      log_level: info
      grpc_server_max_recv_msg_size: 4.194304e+06
      grpc_server_max_send_msg_size: 4.194304e+06
    storage:
      trace:
        backend: s3
        s3:
          bucket:                    # how to store data in s3
          endpoint: s3.dualstack.us-east-1.amazonaws.com
          access_key: <accesskey>
          secret_key: <secret key>
        blocklist_poll: 5m
        wal:
          path: /var/tempo/wal

Expected behavior Tempo needs to successfully retrieve traces when queried from Jaeger and also need to get the stored traces from s3.

Environment:

Infrastructure: AWS EKS Cluster, AWS s3
Deployment tool: helm

joe-elliott commented 1 year ago

Thanks for filing this issue. A few questions to better understand what may be going on here:

Are you saying that you can query traces successfully for about 2 hours after a trace was created and then you can no longer query it?
Is this behavior consistent for search and trace by id lookup?
It seems you are looking to query tempo with both Jaeger (using tempo-query) and Grafana. Is this correct? are you seeing the 2 hours behavior described above using both Jaeger ui and Grafana?
Can you better format your config? it will help me diagnose the issue. Maybe use triple backtick? ```

vaish1707 commented 1 year ago

@joe-elliott , I have formatted the config as you asked for.

yes I'm able to sucessfully get the traces but not specific to one service (this is while having a set up like from jaeger to tempo-query)
trace look up by traceID works well but only the query with service name isn't working(this also works only for 2 hours from the time trace got generated)
yes I'm querying both tempo and jaeger and yes im seeing the two hours behaviour in both jaeger and tempo only. This is expected since jaeger internally queries tempo to get traces. if tempo is unable to quuery and get the data, jaeger also cannot get the data.

joe-elliott commented 1 year ago

I think I'm starting to understand better. There are two issues here.

Query for service returning traces that don't match

Can you share the queries that are being executed that are returning all traces and not just expected traces? The query-frontend should do some request/response logging like this:

level=info ts=2023-01-11T15:52:59.641178922Z caller=handler.go:124 tenant=vulture-tenant method=GET traceID=7feabcfa5b556718 url="/tempo/api/search?tags=vulture-2%3DtNPpslYwNuTYQ&start=1673097760&end=1673101360" duration=119.048467ms response_size=252 status=200

Also, to be clear, If you search Tempo for service.name="foo" it will return all traces that have a span that originates from that service (not just the traces rooted in that service).

Traces disappearing after 2-3 hours The compactor is the component responsible for clearing out old blocks. Can you check compactor logs to see if it's deleting older blocks? Also check the config /status/config of compactors and confirm that the expected retention is set:

compactor:
  compaction:
    block_retention: ??

Can you double check s3 to confirm that the older blocks exist and some other process is not removing them?

Both! Are there any unexpected errors or warnings in your logs that might give us a clue as to what is happening?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

grafana / tempo

Issue while getting traces with Jaeger + tempo + AWS s3 set up #1980