thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.05k stars 2.09k forks source link

Query Frontend: unable to run queries in Graph Format #7799

Open RomyKess opened 1 week ago

RomyKess commented 1 week ago

Thanos, Prometheus and Golang version used:

Thanos version 0.35.0 Go version go1.21.9 Prometheus version 2.39.1

Object Storage Provider: Dell ECS

What happened: Query Frontend is unable to run some metrics in graph format, no matter the resolution, even though Querier and Grafana can.

What you expected to happen: Metric in graph view to return valid results just like Querier does.

How to reproduce it (as minimally and precisely as possible): Test heavy metrics such as container_oom_events_total in graph view on both query and query frontend. Tripper config is response_header_timeout: 2m and max_idle_conns_per_host: 100. The downstream is querier load balancer.

Full logs to relevant components: Error executing query: Unexpected token '<', " <bod"... is not valid JSON

Anything else we need to know: sum(container_oom_events_total) and sum(container_oom_events_total) by (pod) both work in query frontend. It is only the metric on its own that doesn't. Not all metrics have this issue. For example, kube_cronjob_info works just fine in graph format. Running this metric a few times in a row doesn't help, it persists. The table format works for container_oom_events_total, it is only the graph that doesn't.

Environment:

dosubot[bot] commented 1 week ago

Hey @RomyKess! 👋 I'm here to help you with any bugs, questions, or contributions while you wait for a human maintainer. Let's squash some bugs together!

I found a similar closed issue that might be relevant to your problem: Error executing query: "Unexpected token '<',' <h'.. is not a valid json. This issue was resolved by a pull request that handled query parsing errors on the Query Frontend [1].

Additionally, for handling heavy metrics like container_oom_events_total, consider the following best practices:

  1. Query Splitting: Use the --query-range.split-interval flag to split long queries into shorter ones, which can help prevent OOM issues and improve load balancing.
  2. Caching: Enable caching to store and reuse query results, reducing backend load.
  3. Retry Mechanism: Configure retries for failed queries using the --query-range.max-retries-per-request flag.
  4. Horizontal Scalability: Deploy multiple instances of the Query Frontend to distribute the load.
  5. Downstream Tripper Configuration: Ensure max_idle_conns_per_host is set to at least 100 to leverage HTTP keep-alive connections, reducing latency.
  6. Slow Query Logging: Use the --query-frontend.log-queries-longer-than flag to log and optimize slow queries.
  7. Compression: Enable response compression to reduce data transfer over the network [2][3].

These practices can help optimize the performance of the Query Frontend under high-load conditions.

To continue talking to Dosu, mention @dosu.