Open g3david405 opened 4 weeks ago
1000 spans/second is quite low and I'm surprised you're seeing performance issues. Query performance can be quite difficult to debug remotely, but we will do our best.
Can you provide an example of a traceql query that is having issues and the corresponding query frontend logs? We can start there and try to figure out what is causing issues.
Hi @joe-elliott Here is my TraceQL, query backend API trace with status 5xx, limit 100 and with 12 hour duration.
{ resource.service.name =~ "$Job" && (span.url.path=~"$Url" || span.http.route=~"$Url" || span.http.target=~"$Url") && span.http.request.method=~"$Method" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~"$Namespace" || "$Namespace" = ".*") }
Here is my query-frontend log
level=info ts=2024-11-02T16:05:11.870021575Z caller=search_handlers.go:186 msg="search request" tenant=single-tenant query="{ resource.service.name =~ \"{MicroService Name}\" && (span.url.path=~\"\" || span.http.route=~\"\" || span.http.target=~\"\") && span.http.request.method=~\".+\" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~\".*\" || \".*\" = \".*\") }" range_seconds=43200 limit=100 spans_per_spanset=3
level=info ts=2024-11-02T16:05:45.5573293Z caller=reporter.go:257 msg="reporting cluster stats" date=2024-11-02T16:05:45.557324874Z
level=info ts=2024-11-02T16:05:45.904376455Z caller=poller.go:256 msg="successfully pulled tenant index" tenant=single-tenant createdAt=2024-11-02T16:00:37.074394709Z metas=547 compactedMetas=28
level=info ts=2024-11-02T16:05:45.904500382Z caller=poller.go:142 msg="blocklist poll complete" seconds=0.371037565
level=info ts=2024-11-02T16:06:38.069118075Z caller=search_handlers.go:167 msg="search results" tenant=single-tenant query="{ resource.service.name =~ \"{MicroService Name}\" && (span.url.path=~\"\" || span.http.route=~\"\" || span.http.target=~\"\") && span.http.request.method=~\".+\" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~\".*\" || \".*\" = \".*\") }" range_seconds=43200 duration_seconds=86.199097056 request_throughput=4.140089054159755e+07 total_requests=409 total_blockBytes=29200227564 total_blocks=31 completed_requests=409 inspected_bytes=3568719382 inspected_traces=0 inspected_spans=0 status_code=-1 error=null
In this example, I only query with microservice name. If I query with API path, the query time will be longer.
And here is my grafana frontend shows streaming result
I can give more information if you need, thanks!!!
Any Update?
Still waiting for the reply, thx
Howdy, apologies for letting this sit. Been overwhelmed a bit by repo activity lately.
{ resource.service.name =~ "$Job" && (span.url.path=~"$Url" || span.http.route=~"$Url" || span.http.target=~"$Url") && span.http.request.method=~"$Method" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~"$Namespace" || "$Namespace" = ".*") }
This is a brutal query. The obvious issue is the number of regexes. The less obvious issue is that Tempo is really fast at evaluating a bunch of &&'ed conditions and way slower and mixed sets of conditions.
The good news:
Expect this query to improve in 2.7
thanks joe, I’m really looking forward to the release of version 2.7!
I am using the latest version of tempo-distributed (v2.6.1), and my data volume is approximately 1,000 records per second, with a retention period of 21 days totaling around 900 GB. When performing TraceQL queries, I’m encountering significant performance bottlenecks, especially when querying span or resource attributes.
According to this article, https://grafana.com/docs/tempo/latest/operations/backend_search/ Here are the improvements I've implemented so far:
However, despite these adjustments, the performance is still below acceptable levels. Are there any additional optimizations I could make?
my helm chart values is as following: