Open wangjinxiang0522 opened 2 months ago
This controls the amount of time a querier will spend on one subjob of trace by id lookup:
trace_by_id:
query_timeout: 60s
i would adjust the following on your query frontend:
server:
http_server_read_timeout: 60s
http_server_write_timeout: 60s
@joe-elliott , I think the issue is with the Tempo datasource: https://github.com/grafana/grafana/issues/92173
I'm seeing the same errors on my Grafana 11.0.0 (277ef258d4)/ Tempo 2.4.2 (2225623e7)
I'm running a multi-tenant setup and have two separate grafana datasources against the same Tempo instance. One datasource works flawlessly, the other I'm able to query, but it stopped showing me the traces (TraceID queries) with the "499 context cancelled"-error a couple of days a ago.
I get a lot of warnings in the log on this form, which suggest to me some sort of curruption
Aug 26 10:56:52 SERVER tempo[882]: level=warn ts=2024-08-26T08:56:52.554109607Z caller=server.go:2136
\ traceID=3efc1d132d124dd9 msg="GET /querier/api/traces/fe7fee0da8a60b662d751c1631ede46e?
\ blockEnd=ac687d6343eb1a1f0000000000000000&blockStart=a72f05397829cbc10000000000000000&mode=blocks (500)
\ 110.02302ms Response: \"error finding trace by id, blockID: a7f32a3a-39b5-4578-ae2c-29dc2f19cf8a: error retrieving bloom
\ bloom-0 (TENANTNAME, a7f32a3a-39b5-4578-ae2c-29dc2f19cf8a): does not exist; error finding trace by id, blockID:
\ a971b7d2-bf8c-48f2-b113-4298b29669c1: error retrieving bloom bloom-2 (TENANTNAME,
\ a971b7d2-bf8c-48f2-b113-4298b29669c1): does not exist; error finding trace by id, blockID:
\ a7ac3d7a-fc80-4fe6-8372-62f821fd338b: error retrieving bloom bloom-0 (TENANTNAME,
\ a7ac3d7a-fc80-4fe6-8372-62f821fd338b): does not exist;
Anywho, if I specify a timerange on TraceId-queries in the datasource, the errors go away, so it may well be as @cancub says. On the other hand, it works on TENANT2 - and only recently stopped working for TENANT, so ...
Im having the same issue but I could also reproduce it outside of grafana, by port forwarding one of tempo pods and sending api request:
$ curl -G -s http://localhost:3100/api/traces/679e1a61c8106b4d9ecc38c0013a38a
context canceled
Pod logs:
...
level=info ts=2024-08-25T13:04:40.397995757Z caller=tempodb.go:335 org_id=single-tenant msg="searching for trace in block" findTraceID=0679e1a61c8106b4d9ecc38c0013a38a block=b5db443a-e104-4020-9e6b-6e7aad1987d4 found=false
level=info ts=2024-08-25T13:04:40.398263033Z caller=tempodb.go:335 org_id=single-tenant msg="searching for trace in block" findTraceID=0679e1a61c8106b4d9ecc38c0013a38a block=b6881e63-b25a-46ec-a5bd-6f35eb82c8fc found=false
level=info ts=2024-08-25T13:04:40.409949046Z caller=handler.go:109 tenant=single-tenant method=GET traceID=5db3519ec2e3de40 url=/api/traces/679e1a61c8106b4d9ecc38c0013a38a duration=1.971874934s status=500 err="rpc error: code = Code(499) desc = context canceled" response_size=0
Tried to use tempo-cli to check if the data is corrupted but it looks fine:
go run . query trace-summary 679e1a61c8106b4d9ecc38c0013a38a single-tenant --backend="s3" --s3-endpoint=... --bucket=...
Number of blocks: 2
Span count: 14
Trace size: 7052 B
Trace duration: 4 seconds
Root service name: email
Root span info: ...
499 is an invalid status code and Tempo should never return it. I believe we have fixed this in 2.6. If someone is able to test we would appreciate it.
@MikkelPorse You have corrupt blocks/partial in your backend.
@tomshabtay In your case I believe the trace id simply timing out. How long before context cancelled is returned? Do you have any timeouts configured?
@joe-elliott Thanks for your reply, we are currently running with version 2.4 I will let you know if updating to 2.6 will fix it. The context cancelled return in 1.97 seconds in the example I shared.
the querier
configuration is:
querier:
frontend_worker:
frontend_address: tempo-headless.kube-system.svc.cluster.local:9095
trace_by_id:
query_timeout: 60s
search:
query_timeout: 2m
external_hedge_requests_at: 500ms
external_hedge_requests_up_to: 4
what is your query frontend config?
what is your query frontend config?
query_frontend: trace_by_id: query_shards: 100 max_retries: 10 search: max_duration: 0 query_backend_after: 20ms query_ingesters_until: 20ms target_bytes_per_job: 419430400
@joe-elliott , I think the issue is with the Tempo datasource: grafana/grafana#92173
In my case it was actually a bad configuration. We had disabled traceQuery
in the tempo datasource in grafana 10.0.3 due to the time shift error in the datasource. This was resolved, but we forgot to re-add traceQuery
.
query_frontend:
max_retries: 10
search:
query_backend_after: 20ms
query_ingesters_until: 20ms
I don't know if any of these would cause your issue, but they are all concerning to me. 10 is a lot of retries. We use 2 internally.
The two query_...
duration parameters should be left alone. They need to be set based on your polling cycle and complete block timeout to work correctly.
Does anyone knows if Tempo version 2.6 fixes this bug? On my side this bug only occurs on one Tempo Datasource.
In 2.6 this returns a 500 which is better but still not great.
In 2.7 we will have added a new traces v2 endpoint that returns a partial trace along with a warning message when there is a too large trace. We also have a PR up to return a 4xx instead of a 5xx when the trace is too large on the /api/traces endpoint.
Describe the question
When I request a large trace I get a Query error
I set the query_timeout to 60 seconds, but it doesn't seem to take effect. The request fails and returns an error in less than 10 seconds.
Config:
What parameter needs to be changed for the trace to load.