grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.99k stars 519 forks source link

failed to get trace with id: a343f1a19e969ea3003b56dfe3802069 Status: 499 status code 499 Body: context canceled #3980

Open wangjinxiang0522 opened 2 months ago

wangjinxiang0522 commented 2 months ago

Describe the question

When I request a large trace I get a Query error

20240820-174132

    trace_by_id:
      query_timeout: 60s

I set the query_timeout to 60 seconds, but it doesn't seem to take effect. The request fails and returns an error in less than 10 seconds.

Config:

What parameter needs to be changed for the trace to load.

global:
  dnsService: "kube-dns"
  image:
    registry: reg.sprucetec.com

reportingEnabled: false

global_overrides:
  metrics_generator_processors:
    - service-graphs
    - span-metrics

storage:
  trace:
    backend: s3
    s3:
      access_key: '1111'
      secret_key: '2222'
      bucket: 'tempo'
      endpoint: 'minio.apm-minio-cluster.svc.cluster.local:80'
      insecure: true
minio:
  enabled: false

traces:
  otlp:
    grpc:
      enabled: true
      receiverConfig:
        max_recv_msg_size_mib: 100
    http:
      enabled: true

server:
  # 30M
  grpc_server_max_recv_msg_size: 31457280
  grpc_server_max_send_msg_size: 31457280
  http_server_read_timeout: 30s
  http_server_write_timeout: 30s

ingester:
  replicas: 20
  nodeSelector:
    node-type: monitoring
  config:
    replication_factor : 3
    max_block_bytes: 209715200
    max_block_duration: 18m
    complete_block_timeout: 15m
    flush_all_on_shutdown: true
    max_traces_per_user: 80000
  persistence:
    enabled: false
    inMemory: false
    #size: 20Gi
    #storageClass: "csi-rbd-sc"
  extraEnv:
    - name: GOMEMLIMIT
      value: "14745MiB"
    - name: GOMAXPROCS
      value: "5"
  resources:
    limits:
      memory: "16Gi"
      cpu: "5"
    requests:
      memory: "5Gi"
      cpu: "1"

distributor:
  affinity: {}
  replicas: 12
  nodeSelector:
    node-type: monitoring
  extraEnv:
    - name: GOMEMLIMIT
      value: "7372MiB"
    - name: GOMAXPROCS
      value: "4"
  resources:
    limits:
      memory: "8Gi"
      cpu: "4"
    requests:
      memory: "500Mi"
      cpu: "500m"

metricsGenerator:
  affinity: {}
  enabled: true
  replicas: 12
  nodeSelector:
    node-type: monitoring
  config:
    storage:
      remote_write:
        - send_exemplars: true
          url: http://mimir-gateway.monitoring.svc:80/api/v1/write
  persistence:
    enabled: true
    size: 20Gi
    storageClass: "csi-rbd-sc"
  extraEnv:
    - name: GOMEMLIMIT
      value: "15564MiB"
    - name: GOMAXPROCS
      value: "6"
  resources:
    limits:
      memory: "16Gi"
      cpu: "6"
    requests:
      memory: "128Mi"
      cpu: "500m"

compactor:
  replicas: 2
  nodeSelector:
    node-type: monitoring
  config:
    compaction:
      #保留块的持续时间设置5天。默认值为3天(336小时)。
      block_retention: 72h
      #保留已在其他位置压缩的块的持续时间。默认值为1小时。
      compacted_block_retention: 30m
      #在此时间窗口内的块将一起进行压缩。默认值为1小时
      compaction_window: 15m
      #压缩块中的最大跟踪数。默认值为600万 设置6W
      max_compaction_objects: 60000
      #默认值为100 GB。 50M
      max_block_bytes: 52428800
      # -- Number of traces to buffer in memory during compaction
      v2_prefetch_traces_count: 1000
      # -- Amount of data to buffer from input blocks
      v2_in_buffer_bytes: 5242880
      # -- Flush data to backend when buffer is this large
      v2_out_buffer_bytes: 20971520
  extraEnv:
    - name: GOMEMLIMIT
      value: "15564MiB"
    - name: GOMAXPROCS
      value: "2"
  resources:
    limits:
      memory: "16Gi"
      cpu: "2"
    requests:
      memory: "2Gi"
      cpu: "500m"

querier:
  affinity: {}
  replicas: 15
  config:
    trace_by_id:
      query_timeout: 60s
    frontend_worker:
      grpc_client_config:
        max_send_msg_size: 31457280
    max_concurrent_queries: 100
    search:
      query_timeout: 60s
      prefer_self: 10
      external_hedge_requests_at: 15s
  nodeSelector:
    node-type: monitoring
  extraEnv:
    - name: GOMEMLIMIT
      value: "9216MiB"
    - name: GOMAXPROCS
      value: "3"
  resources:
    limits:
      memory: "10Gi"
      cpu: "3"
    requests:
      memory: "128Mi"
      cpu: "100m"

queryFrontend:
  affinity: {}
  replicas: 4
  nodeSelector:
    node-type: monitoring
  config:
    max_outstanding_per_tenant: 3000
    max_retries: 2
    search:
      concurrent_jobs: 1500
      target_bytes_per_job: 209715200
    trace_by_id:
      query_shards: 1000
      hedge_requests_at: 15s
      hedge_requests_up_to: 2
  extraEnv:
    - name: GOMEMLIMIT
      value: "3891MiB"
    - name: GOMAXPROCS
      value: "3"
  resources:
    limits:
      memory: "4Gi"
      cpu: "3"
    requests:
      memory: "500Mi"
      cpu: "100m"

gateway:
  affinity: {}
  verboseLogging: false
  enabled: true
  image:
    registry: reg.sprucetec.com
    repository: monitor/docker.io/nginxinc/nginx-unprivileged
    tag: 1.19-alpine
    pullPolicy: IfNotPresent
  replicas: 2
  nodeSelector:
    node-type: monitoring
  extraEnv:
    - name: GOMEMLIMIT
      value: "3891MiB"
    - name: GOMAXPROCS
      value: "2"
  resources:
    limits:
      memory: "4Gi"
      cpu: "2"
    requests:
      memory: "128Mi"
      cpu: "100m"

memcached:
  affinity: {}
  enabled: true
  image:
    registry: reg.sprucetec.com
    repository: monitor/docker.io/memcached
    tag: 1.6.23-alpine
    pullPolicy: IfNotPresent
  extraArgs:
    - -m 16384
    - -I 64m
    - -vv
  replicas: 1
  nodeSelector:
    node-type: monitoring
  extraEnv:
    - name: GOMEMLIMIT
      value: "17408MiB"
    - name: GOMAXPROCS
      value: "2"
  resources:
    limits:
      memory: "17Gi"
      cpu: "2"
    requests:
      memory: "128Mi"
      cpu: "100m"

memcachedExporter:
  enabled: true
  image:
    registry: reg.sprucetec.com
    repository: monitor/docker.io/prom/memcached-exporter
    tag: v0.13.0
    pullPolicy: IfNotPresent

tempo:
  image:
    registry: reg.sprucetec.com
    repository: monitor/docker.io/grafana/tempo
    tag: 2.4.0
    pullPolicy: IfNotPresent
  global_overrides:
    # 5M
    max_bytes_per_trace: 5242880
    ingestion_rate_limit_bytes: 40000000
    max_traces_per_user: 800000

overrides: |
  overrides:
    "*":
      max_traces_per_user: 50000
      ingestion_burst_size_bytes: 100000000
      ingestion_rate_limit_bytes: 50000000

metaMonitoring:
  serviceMonitor:
    enabled: true
    namespace: monitoring
    labels:
      release: prometheus
### Tasks
joe-elliott commented 2 months ago

This controls the amount of time a querier will spend on one subjob of trace by id lookup:

    trace_by_id:
      query_timeout: 60s

i would adjust the following on your query frontend:

server:
    http_server_read_timeout: 60s
    http_server_write_timeout: 60s
cancub commented 2 months ago

@joe-elliott , I think the issue is with the Tempo datasource: https://github.com/grafana/grafana/issues/92173

MikkelPorse commented 2 months ago

I'm seeing the same errors on my Grafana 11.0.0 (277ef258d4)/ Tempo 2.4.2 (2225623e7)

I'm running a multi-tenant setup and have two separate grafana datasources against the same Tempo instance. One datasource works flawlessly, the other I'm able to query, but it stopped showing me the traces (TraceID queries) with the "499 context cancelled"-error a couple of days a ago.

I get a lot of warnings in the log on this form, which suggest to me some sort of curruption

Aug 26 10:56:52 SERVER tempo[882]: level=warn ts=2024-08-26T08:56:52.554109607Z caller=server.go:2136
\ traceID=3efc1d132d124dd9 msg="GET /querier/api/traces/fe7fee0da8a60b662d751c1631ede46e?
\ blockEnd=ac687d6343eb1a1f0000000000000000&blockStart=a72f05397829cbc10000000000000000&mode=blocks (500) 
\ 110.02302ms Response: \"error finding trace by id, blockID: a7f32a3a-39b5-4578-ae2c-29dc2f19cf8a: error retrieving bloom 
\ bloom-0 (TENANTNAME, a7f32a3a-39b5-4578-ae2c-29dc2f19cf8a): does not exist; error finding trace by id, blockID: 
\ a971b7d2-bf8c-48f2-b113-4298b29669c1: error retrieving bloom bloom-2 (TENANTNAME, 
\ a971b7d2-bf8c-48f2-b113-4298b29669c1): does not exist; error finding trace by id, blockID: 
\ a7ac3d7a-fc80-4fe6-8372-62f821fd338b: error retrieving bloom bloom-0 (TENANTNAME, 
\ a7ac3d7a-fc80-4fe6-8372-62f821fd338b): does not exist;

Anywho, if I specify a timerange on TraceId-queries in the datasource, the errors go away, so it may well be as @cancub says. On the other hand, it works on TENANT2 - and only recently stopped working for TENANT, so ...

tomshabtay commented 2 months ago

Im having the same issue but I could also reproduce it outside of grafana, by port forwarding one of tempo pods and sending api request:

$ curl -G -s http://localhost:3100/api/traces/679e1a61c8106b4d9ecc38c0013a38a
context canceled

Pod logs:

...
level=info ts=2024-08-25T13:04:40.397995757Z caller=tempodb.go:335 org_id=single-tenant msg="searching for trace in block" findTraceID=0679e1a61c8106b4d9ecc38c0013a38a block=b5db443a-e104-4020-9e6b-6e7aad1987d4 found=false
level=info ts=2024-08-25T13:04:40.398263033Z caller=tempodb.go:335 org_id=single-tenant msg="searching for trace in block" findTraceID=0679e1a61c8106b4d9ecc38c0013a38a block=b6881e63-b25a-46ec-a5bd-6f35eb82c8fc found=false
level=info ts=2024-08-25T13:04:40.409949046Z caller=handler.go:109 tenant=single-tenant method=GET traceID=5db3519ec2e3de40 url=/api/traces/679e1a61c8106b4d9ecc38c0013a38a duration=1.971874934s status=500 err="rpc error: code = Code(499) desc = context canceled" response_size=0

Tried to use tempo-cli to check if the data is corrupted but it looks fine:

go run . query trace-summary 679e1a61c8106b4d9ecc38c0013a38a single-tenant --backend="s3" --s3-endpoint=... --bucket=...
Number of blocks: 2
Span count: 14
Trace size: 7052 B
Trace duration: 4 seconds
Root service name: email 
Root span info: ...
joe-elliott commented 2 months ago

499 is an invalid status code and Tempo should never return it. I believe we have fixed this in 2.6. If someone is able to test we would appreciate it.

@MikkelPorse You have corrupt blocks/partial in your backend.

@tomshabtay In your case I believe the trace id simply timing out. How long before context cancelled is returned? Do you have any timeouts configured?

tomshabtay commented 2 months ago

@joe-elliott Thanks for your reply, we are currently running with version 2.4 I will let you know if updating to 2.6 will fix it. The context cancelled return in 1.97 seconds in the example I shared.

the querier configuration is:

querier:
  frontend_worker:
    frontend_address: tempo-headless.kube-system.svc.cluster.local:9095
  trace_by_id:
    query_timeout: 60s
  search:
    query_timeout: 2m
    external_hedge_requests_at: 500ms
    external_hedge_requests_up_to: 4
joe-elliott commented 2 months ago

what is your query frontend config?

tomshabtay commented 2 months ago

what is your query frontend config?

query_frontend:
trace_by_id:
query_shards: 100
max_retries: 10
search:
max_duration: 0
query_backend_after: 20ms
query_ingesters_until: 20ms
target_bytes_per_job: 419430400
cancub commented 2 months ago

@joe-elliott , I think the issue is with the Tempo datasource: grafana/grafana#92173

In my case it was actually a bad configuration. We had disabled traceQuery in the tempo datasource in grafana 10.0.3 due to the time shift error in the datasource. This was resolved, but we forgot to re-add traceQuery.

joe-elliott commented 2 months ago
query_frontend:
  max_retries: 10
  search:
    query_backend_after: 20ms
    query_ingesters_until: 20ms

I don't know if any of these would cause your issue, but they are all concerning to me. 10 is a lot of retries. We use 2 internally.

The two query_... duration parameters should be left alone. They need to be set based on your polling cycle and complete block timeout to work correctly.

DerkieH commented 2 weeks ago

Does anyone knows if Tempo version 2.6 fixes this bug? On my side this bug only occurs on one Tempo Datasource.

joe-elliott commented 2 weeks ago

In 2.6 this returns a 500 which is better but still not great.

In 2.7 we will have added a new traces v2 endpoint that returns a partial trace along with a warning message when there is a too large trace. We also have a PR up to return a 4xx instead of a 5xx when the trace is too large on the /api/traces endpoint.

https://grafana.com/docs/tempo/next/api_docs/#query-v2