thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.07k stars 2.09k forks source link

Thanos Querier aggregate functions(count,sum,min,max) on metrics is not honouring --query.timeout configurational settings #7846

Open mohaabduvisa opened 2 hours ago

mohaabduvisa commented 2 hours ago

What happened: Thanos Querier aggregate functions(count,sum,min,max) on metrics is not honouring --query.timeout configurational settings. We are passing - --query.timeout=30s as argument in Thanos querier deployment to avoid prometheus/thanos/sidecar getting OOM killed due to long running/expensive queries. Raw query honours the timeout and terminates the query beyond 30 seconds where as aggregate query doesn't terminate beyond the configured timeout

What you expected to happen: Thanos querier to terminate long running queries breaching 30 seconds whether it's raw or aggregated query

How to reproduce it (as minimally and precisely as possible): Running aggregated query like count({name!=""}) by (name) to figure out if query timeout is being applied on the long running queries

FYI Total 3M samples available in prometheus

1) Below screenshot displays the working scenario where query getting terminated

timeout_wrkng

2) Below screenshot displays the not working scenario where query is not getting terminated

timeout_not_wrkng

Environment: Thanos: 0.35.1 prometheus: 2.52.0 golang: 1.22.5

dosubot[bot] commented 2 hours ago

Hey @mohaabduvisa! 👋 I'm here to help you with your tech issues and questions as you wait for a human maintainer.

The issue you're experiencing with Thanos Querier not honoring the --query.timeout setting for aggregate functions like count, sum, min, and max could be related to how the timeout is applied in the query processing pipeline. The --query.timeout flag is implemented to set a maximum time for processing a query by configuring the Timeout field in the promql.EngineOpts struct [1].

However, there might be differences in how timeouts are enforced for raw versus aggregate queries, especially in distributed execution models. The distributed execution model can affect timeout enforcement by breaking down queries into fragments and delegating them to different components, which might lead to inconsistencies in timeout behavior for complex queries like aggregates [2][3].

To address this, you might want to check if there are any specific configurations or settings that could affect the behavior of aggregate functions with respect to timeouts. Additionally, consider reviewing the distributed execution settings and ensure that the timeout configurations are consistently applied across all components involved in query processing. If the issue persists, it might require further investigation into the specific implementation details or potential limitations in the current version of Thanos you are using.

To continue talking to Dosu, mention @dosu.