thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.13k stars 2.1k forks source link

add `query-range.timeout` to query-frontend #7046

Open m-yosefpor opened 10 months ago

m-yosefpor commented 10 months ago

Is your proposal related to a problem?

The current behavior of the Thanos query frontend poses challenges when dealing with horizontal sharding using the query-range.split-interval parameter. Specifically, the query.timeout flag in the query subcommand is applied for each split interval, rather than enforcing a global timeout for the entire range query. This results in the query frontend continuing to process individual chunks for a long time, leading to extended query processing times that can impact system performance, resource exhaustion and user experience. Currently thanos supports the slow query logging feature (query-frontend.log-queries-longer-than) to include the ability to abort queries based on a predefined threshold. While this approach can provide visibility into long-running queries, it may not prevent them from consuming resources and affecting system performance.

Describe the solution you'd like

To address this issue, I propose introducing a new flag, query-range.timeout, specifically for the query frontend. This flag would allow users to set a global timeout for range queries, ensuring that the frontend aborts requests that exceed this duration. By setting a query-range.timeout, users can prevent range queries from continuing indefinitely, even if individual split intervals are completed within the specified query.timeout. (like what query-frontend.log-queries-longer-than calculates and logs)

Vanshikav123 commented 10 months ago

what should be the default value you want to be set of the flag query-range.timeout ?

m-yosefpor commented 10 months ago

what should be the default value you want to be set of the flag query-range.timeout ?

As query.timeout has a 2m default interval, probably we need a larger default timeout for query-range.timeout. Maybe 5m would be a good default value, and then people can start to tune the flag for their usecases. (we need a much lower timeout for our usecase, e.g. 30s however we have set query.timeout flag in querier to 10s)

yeya24 commented 10 months ago

You can do this via having a gateway such as envoy or ambassador in front of Query Frontend. You can enforce the timeout there and Query frontend queries will cancel context when client timeout (context canceled)

kartikaysaxena commented 10 months ago

Can I give this a try?

Vanshikav123 commented 10 months ago

Can I give this a try?

Hello @kartikaysaxena I am currently working on this and raised a PR too , will pass it on to you if it doesn't works.