grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.96k stars 512 forks source link

Missing values in traceQL metrics #4176

Open koraven opened 4 days ago

koraven commented 4 days ago

Describe the bug We are currently testing traceQL metrics as it's one of key features our developers need and I noticed strange behavior.

We have missing values with traceQL metrics. There are no values with any metrics function sometimes even when I definitely know there are spans (by using search without metrics functions). It looks like this (Last hour timeframe): Image Just a short line. It does not matter if it's rate() or quantile_over_time

If I remove 'span.db.system="mongodb"' from condition I will see the 1 hour of metrics. I am sure there are data when metrics are empty because if I select timeframe without metrics and just run a search, I easily find 500 traces (each contains multiple spans matching condition) Image

To Reproduce Hard to say. Usually I see broken metrics when trying to get metrics from spans of "client" kind or "consumer". Not sure if it's related, maybe there are just less of them.

Expected behavior If there are spans I want to see metrics generated from them.

Environment:

Additional Context I don't see anything suspicious in logs. However, I have "warn" level. I'll try with info.

Seems like it does not matter if data is on metrics-generator or on backend. When I run the queries for the data which is definitely in S3 it looks a bit better, but still usually the beginning and the last 20 minutes of hour is empty.

Here are parts of my tempo.yaml:

metrics_generator:
  metrics_ingestion_time_range_slack: 60s
  service_graphs: ...
  span_metrics: ...
  local_blocks:
      complete_block_timeout: 1h
      flush_check_period: 30s
      trace_idle_period: 30s
      flush_to_storage: true
  registry:
    collection_interval: 30s
    external_labels:
      prometheus: tempo-tracing
    stale_duration: 5m
  ring:
    kvstore:
      store: memberlist
  storage:
    path: /var/tempo/wal
    remote_write:
      - url: http://prometheus-tempo-tracing.tracing.svc.cluster.local/api/v1/write
    remote_write_flush_deadline: 1m
    wal: null
  traces_storage:
    path: /var/tempo/traces
....
query_frontend:
  max_outstanding_per_tenant: 2000
  max_retries: 2
  search:
    concurrent_jobs: 100
    target_bytes_per_job: 604857600
  trace_by_id:
    query_shards: 50
  metrics:
    max_duration: 8h
    concurrent_jobs: 100
    target_bytes_per_job: 604857600
    query_backend_after: 30m
...
querier:
  frontend_worker:
    frontend_address: tempo-query-frontend-discovery:9095
    grpc_client_config:
      max_send_msg_size: 20971520
  max_concurrent_queries: 5
  search:
    external_backend: null
    external_endpoints: []
    external_hedge_requests_at: 8s
    external_hedge_requests_up_to: 2
    prefer_self: 10
    query_timeout: 60s
  trace_by_id:
    query_timeout: 20s
mdisibio commented 1 day ago

Hi thanks for the detailed information. Agree it does seem like metrics are missing, as the screenshot of search results has spans within the same time range as the metrics query. You have already covered the basics including enabling the local-blocks processor and flush_to_storage.

The next thing to check is the in-memory queue between the distributors and the generators. If the queue is full then the distributors will not be able to send incoming spans to the generators, and therefore they will not get flushed to storage. It would show up as missing metrics, but search still works because it reads the data flushed by the ingesters.

Please take a look at the following metrics and see if there are any discards:

Let's check these 2 metrics and see if those are happening.