grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.04k stars 524 forks source link

How to Create a Prometheus Alert for Missing Traces for a Specific Component in Tempo? #4322

Open rajushrajan opened 1 week ago

rajushrajan commented 1 week ago

Hi everyone,

I’m working on a Prometheus alert to trigger when traces are missing for any component in Tempo. Currently, I have the following query, which triggers an alert when there are no traces available for a specific time window (e.g., 5 minutes):

sum by (cluster, namespace) (avg_over_time(tempo_ingester_live_traces[5m])) == 0

This works well for triggering an alert when no traces are ingested for the entire system (across any components) within the specified time window. However, I need to modify the query so that the alert is triggered when traces are missing for any component within a specific namespace or cluster.

How can I modify the query so that it triggers an alert when traces are missing for any component (not just globally or for a specific component ) within a cluster or namespace? I want the query to check for missing traces for each component, rather than globally.

I am using Tempo for trace ingestion and Prometheus for monitoring. The metric I’m working with is tempo_ingester_live_traces, which is labeled by component, namespace, and cluster.

javiermolinar commented 1 week ago

Hi,

I believe the span metrics from the Metrics Generator can help you achieve what you want:

https://grafana.com/docs/tempo/latest/metrics-generator/span_metrics/

These metrics include additional labels, based on the trace data, for instance, the name of the service that generated the span. You can even define custom labels.

joe-elliott commented 1 week ago

I will also point out we've recently added "usage trackers" which will be in Tempo 2.7:

https://github.com/grafana/tempo/pull/4162

These will allow you to breakdown received bytes/second by any span or resource labels (namespace, cluster, etc) and publish those metrics directly from the distributor. (no metrics generator/prometheus required)

rajushrajan commented 4 days ago

Hi @joe-elliott , Thank you for your response. I will explore the usage trackers and get back to you.