Open mkrull opened 1 month ago
This comment probably refers to the same issue: https://github.com/thanos-io/thanos/issues/6744#issuecomment-1789575167
I think it's a consequence of https://github.com/thanos-io/thanos/pull/6706. We had to fix a correctness bug and as a consequence, responses need to be sorted in memory before being sent off. Unfortunately, but Prometheus sometimes produces not a sorted response and that needs to be fixed upstream. Or external labels functionality has to be completely reworked. See https://github.com/prometheus/prometheus/issues/12605
Ouch, I see. Upgrading in environments like Kubernetes comes with a considerable new risk of OOMs for pods running Prometheus with Thanos sidecar because it gets really hard to estimate max memory requirements for the sidecar containers 🤔
Thanos, Prometheus and Golang version used:
Object Storage Provider:
AWS S3
What happened:
After upgrading from 0.31.0 to 0.35.0 we saw greatly increased sidecar memory usage and narrowed it down to a change between 0.32.2 and 0.32.3 (the Prometheus update maybe?).
The memory usage shoots up for certain queries, for us likely recording rules by the ruler, thus constantly high usage was observed.
What you expected to happen:
No significant change in memory usage.
How to reproduce it (as minimally and precisely as possible):
Run
{job=".+"}
on Prometheus with some metrics for either version and compare memory usage.Full logs to relevant components:
Anything else we need to know:
Heap profiles for 0.32.2 and 0.32.3 with the same query on the same Prometheus node: