Open RomyKess opened 1 week ago
Hey @RomyKess! 👋 I'm here to help you with any bugs, questions, or contributions while you wait for a human maintainer. Let's squash some bugs together!
I found a similar closed issue that might be relevant to your problem: Error executing query: "Unexpected token '<',' <h'.. is not a valid json. This issue was resolved by a pull request that handled query parsing errors on the Query Frontend [1].
Additionally, for handling heavy metrics like container_oom_events_total
, consider the following best practices:
--query-range.split-interval
flag to split long queries into shorter ones, which can help prevent OOM issues and improve load balancing.--query-range.max-retries-per-request
flag.max_idle_conns_per_host
is set to at least 100 to leverage HTTP keep-alive connections, reducing latency.--query-frontend.log-queries-longer-than
flag to log and optimize slow queries.These practices can help optimize the performance of the Query Frontend under high-load conditions.
To continue talking to Dosu, mention @dosu.
Thanos, Prometheus and Golang version used:
Thanos version 0.35.0 Go version go1.21.9 Prometheus version 2.39.1
Object Storage Provider: Dell ECS
What happened: Query Frontend is unable to run some metrics in graph format, no matter the resolution, even though Querier and Grafana can.
What you expected to happen: Metric in graph view to return valid results just like Querier does.
How to reproduce it (as minimally and precisely as possible): Test heavy metrics such as container_oom_events_total in graph view on both query and query frontend. Tripper config is response_header_timeout: 2m and max_idle_conns_per_host: 100. The downstream is querier load balancer.
Full logs to relevant components: Error executing query: Unexpected token '<', " <bod"... is not valid JSON
Anything else we need to know: sum(container_oom_events_total) and sum(container_oom_events_total) by (pod) both work in query frontend. It is only the metric on its own that doesn't. Not all metrics have this issue. For example, kube_cronjob_info works just fine in graph format. Running this metric a few times in a row doesn't help, it persists. The table format works for container_oom_events_total, it is only the graph that doesn't.
Environment: