Missing values in traceQL metrics

koraven commented 1 month ago

Describe the bug We are currently testing traceQL metrics as it's one of key features our developers need and I noticed strange behavior.

We have missing values with traceQL metrics. There are no values with any metrics function sometimes even when I definitely know there are spans (by using search without metrics functions). It looks like this (Last hour timeframe): Just a short line. It does not matter if it's rate() or quantile_over_time

If I remove 'span.db.system="mongodb"' from condition I will see the 1 hour of metrics. I am sure there are data when metrics are empty because if I select timeframe without metrics and just run a search, I easily find 500 traces (each contains multiple spans matching condition)

To Reproduce Hard to say. Usually I see broken metrics when trying to get metrics from spans of "client" kind or "consumer". Not sure if it's related, maybe there are just less of them.

Expected behavior If there are spans I want to see metrics generated from them.

Environment:

Infrastructure: Kubernetes, AWS S3
Deployment tool: We use cdk8s with self-written chart based on official helm chart

Additional Context I don't see anything suspicious in logs. However, I have "warn" level. I'll try with info.

Seems like it does not matter if data is on metrics-generator or on backend. When I run the queries for the data which is definitely in S3 it looks a bit better, but still usually the beginning and the last 20 minutes of hour is empty.

Here are parts of my tempo.yaml:

metrics_generator:
  metrics_ingestion_time_range_slack: 60s
  service_graphs: ...
  span_metrics: ...
  local_blocks:
      complete_block_timeout: 1h
      flush_check_period: 30s
      trace_idle_period: 30s
      flush_to_storage: true
  registry:
    collection_interval: 30s
    external_labels:
      prometheus: tempo-tracing
    stale_duration: 5m
  ring:
    kvstore:
      store: memberlist
  storage:
    path: /var/tempo/wal
    remote_write:
      - url: http://prometheus-tempo-tracing.tracing.svc.cluster.local/api/v1/write
    remote_write_flush_deadline: 1m
    wal: null
  traces_storage:
    path: /var/tempo/traces
....
query_frontend:
  max_outstanding_per_tenant: 2000
  max_retries: 2
  search:
    concurrent_jobs: 100
    target_bytes_per_job: 604857600
  trace_by_id:
    query_shards: 50
  metrics:
    max_duration: 8h
    concurrent_jobs: 100
    target_bytes_per_job: 604857600
    query_backend_after: 30m
...
querier:
  frontend_worker:
    frontend_address: tempo-query-frontend-discovery:9095
    grpc_client_config:
      max_send_msg_size: 20971520
  max_concurrent_queries: 5
  search:
    external_backend: null
    external_endpoints: []
    external_hedge_requests_at: 8s
    external_hedge_requests_up_to: 2
    prefer_self: 10
    query_timeout: 60s
  trace_by_id:
    query_timeout: 20s

mdisibio commented 1 month ago

Hi thanks for the detailed information. Agree it does seem like metrics are missing, as the screenshot of search results has spans within the same time range as the metrics query. You have already covered the basics including enabling the local-blocks processor and flush_to_storage.

The next thing to check is the in-memory queue between the distributors and the generators. If the queue is full then the distributors will not be able to send incoming spans to the generators, and therefore they will not get flushed to storage. It would show up as missing metrics, but search still works because it reads the data flushed by the ingesters.

Please take a look at the following metrics and see if there are any discards:

tempo_distributor_forwarder_pushes_failures_total This is when the per-tenant queue was full. If this is non-zero it can be increased with the tenant override setting: metrics_generator_forwarder_queue_size. Typical values are 1000-5000, but you can go higher if needed.
tempo_distributor_metrics_generator_pushes_failures_total This is when data couldn't be sent to the generators from the queue. If this is non-zero it usually means the generators can't keep up and need more replicas.

Let's check these 2 metrics and see if those are happening.

koraven commented 1 month ago

tempo_distributor_forwarder_pushes_failures_total is 0 for the last week. tempo_distributor_metrics_generator_pushes_failures_total - rare spikes for 5-10 seconds once a day.

I also checked resources in general. All fine. Also, I would expect similar picture for all spans, but it looks like the issue affects only some types of spans. I haven't noticed yet any issues with server spans. Other kinds have issues.

The screenshot was made today (18.10) for 3 days old data. That data is definitely in s3 bucket, and there should not be any issues with queues. But still server kind is fine while 'consumer' is 0.

I also checked what "metrics from spans" generates for prometheus: sum(rate(traces_spanmetrics_calls_total{service="<service>",span_kind="SPAN_KIND_CONSUMER"}[1m])) shows reasonable values

koraven commented 1 month ago

I can try to check what's inside the blocks in S3 with Pandas or other tools, but I don't know how to find blocks generated by metrics generator and distinguish them from ingester blocks

koraven commented 1 month ago

ok. Looks like the issue was about "filter_server_spans". I set it to false and now it works as expected.

As I understand now, the rare spikes in non-server spans I saw were parent spans of non-server kind.

I hope I am the only one who was confused by "only parent spans" part because I knew about 'filter_server_spans' setting, but was sure I have no issues with it because I saw non-server spans in the output. Eventually, it appeared I missed that part about "only parent spans or spans with the SpanKind of server"

grafana / tempo

Missing values in traceQL metrics #4176