Open EStork09 opened 1 year ago
Agree this is a good long term goal. Are we only looking to alert on metrics dynamically calculated from traces? or simply existence?
i.e.
{ status = error } | count() > 5
This doesn't create a metric, but it will match traces that have more than 5 errors in them. Do we want to alert on this? or metrics only?
I would assume what you have in your example is what we would look to alert on, otherwise we could just alert on the metrics from the metric exporter. Having it be a traceql query would also allow for a quick look up to see those traces that are linked to the alert, for example.
So the ability to push alerts to alertmanager using TraceQL queries would be pretty cool and I'd support someone working on it now. We don't yet have the ability to do this on generic metrics over traces alerts but we could do this on whether or not a TraceQL query returned any hits.
This is not on anyone's roadmap at the moment and would require a fair amount of thought. If someone would like to pick it up, let us know.
I was looking at this earlier, and was going to reference the loki or mimir rulers to use as a starting point but it feels as though tempo is fairly different from mimir and loki, enough that I can't just easily lift and shift the logic over, but that might be just due to me being unfamiliar with the code bases. I want to try to look into this, but it might be a little out of my depth, not entirely sure the best way to start it.
I believe the current Loki and Mimir rulers write metrics, but a Tempo ruler could not quite do that yet. Do the Loki and Mimir rulers ever hit alertmanager? That's about the only thing we could add right now.
Maybe this idea just needs to be on hold until we add metrics from traces?
Ah yeah that is a really good point. Loki appears to have the ruler independent from the alertmanager, that the ruler evaluates the rules (does a query against the querier) then fires off to alertmanager when they trigger.
But yes, loki rules use the rate({query}[interval])
to convert to metrics for the ruler to evaluate, so tempo would most likely need a rate syntax (along with the sum, count, avg, etc to combine with that to produce numeric values) in order to align with loki on how they are doing logql to create alerts. Do you want me to open a new issue to track generating metrics from traceql queries? (i.e. adding the rate syntax and such) Or is it already being tracked somewhere?
Even if there is no issue metrics from traces is definitely on our list. It may take a bit to get there though.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
I can give it a try.
Is there an update on this? Alerting based on TraceQL queries seems like a basic functionality for alerting based on trace data
Is there an update on this? Alerting based on TraceQL queries seems like a basic functionality for alerting based on trace data
Technically, you could create something to alert on Tempo/Traceql metrics or selection queries right now. If the question is: how do I integrate Tempo into Prometheus Alertmanager then your options are more limited.
Current The only currently available way to use Alertmanager would be to configure the Tempo Metrics Generator to remote write metrics to a Prometheus instance and set up your alerts normally.
Future There are no current plans, but presumably in the future we would develop a "ruler" style component. This component would take TraceQL metrics queries, execute them as instant queries and remote write the results to Prometheus. This would allow quite a bit more flexibility then the metrics generator approach.
Related Thoughts TraceQL metrics work right now but are quite difficult to tune for performance. Internally we are looking to some architecture changes to productionize RF1. This will allow for everyone to enjoy performant TraceQL metrics as well as very nicely slash the TCO of OSS Tempo. This is our current primary focus as a team.
Thanks for the clarification. I just want to double-check whether we are talking about the same thing. It could be that I am missing some things, so I want to clarify:
I'm looking for Tempo support in Grafana Alerting, not necessarily native Prometheus / Alertmanager support. Grafana Alerting already supports many different data sources (91 according to the Grafana Plugin catalog) as of this writing.
I did not look into the details about what it takes to add alerting support for a data source, so I might be missing some things. From the outside it just looks a bit confusing that there are countless comunity plugins for e.g. elasticsearch that are working fine with Grafana Alerting, yet Grafana Tempo is not supported and support is not even on the roadmap as of now.
I know this topic is specifically about a Ruler, but other Grafana Alerting related issues were closed in favor of this issue as well. I just want to understand the current limitations and ensure we are talking about the same thing.
Also we are using Tempo Metrics Generator to generate metrics from traces right now for alerting, which works fine. Nevertheless a more direct support in Grafana Alerting would be nice.
Thanks again for your help and this great project!
I spoke with our internal team and there are no immediate plans to support Tempo in Grafana Alerting, but we all agree it will eventually make sense to add.
If you were to file a feature request in https://github.com/grafana/grafana/ it would help us track the demand for features like this.
Is there an update on this? Alerting based on TraceQL queries seems like a basic functionality for alerting based on trace data
Technically, you could create something to alert on Tempo/Traceql metrics or selection queries right now. If the question is: how do I integrate Tempo into Prometheus Alertmanager then your options are more limited.
Current The only currently available way to use Alertmanager would be to configure the Tempo Metrics Generator to remote write metrics to a Prometheus instance and set up your alerts normally.
Future There are no current plans, but presumably in the future we would develop a "ruler" style component. This component would take TraceQL metrics queries, execute them as instant queries and remote write the results to Prometheus. This would allow quite a bit more flexibility then the metrics generator approach.
Related Thoughts TraceQL metrics work right now but are quite difficult to tune for performance. Internally we are looking to some architecture changes to productionize RF1. This will allow for everyone to enjoy performant TraceQL metrics as well as very nicely slash the TCO of OSS Tempo. This is our current primary focus as a team.
Hi everyone, The TraceQL feature for metrics is really good and useful for me. I did configure tempo to remote write to Prometheus but i cant find the metrics I am looking for in order to create an alert. How can I see the traces/spans durations in Prometheus? Thanks
Is your feature request related to a problem? Please describe.
It would be good to create alerts based upon traces that come into tempo.
Describe the solution you'd like
Add a ruler to tempo similar to how loki and mimir have them.
Describe alternatives you've considered
Metrics Generator is the current option but it doesn't allow things to trigger based upon specific attributes from traces.
Additional context
A similar issue was created #2331, but I figure a feature request to track the long term desire would be good.