thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

sidecar: Greatly increased Thanos sidecar memory usage from 0.32.2 to 0.32.3, still exists in 0.35.0 #7395

Open mkrull opened 1 month ago

mkrull commented 1 month ago

Thanos, Prometheus and Golang version used:

thanos, version 0.32.3 (branch: HEAD, revision: 3d98d7ce7a254b893e4c8ee8122f7f6edd3174bd)
  build user:       root@0b3c549e9dae
  build date:       20230920-07:27:32
  go version:       go1.20.8
  platform:         linux/amd64
  tags:             netgo

Object Storage Provider:

AWS S3

What happened:

After upgrading from 0.31.0 to 0.35.0 we saw greatly increased sidecar memory usage and narrowed it down to a change between 0.32.2 and 0.32.3 (the Prometheus update maybe?).

The memory usage shoots up for certain queries, for us likely recording rules by the ruler, thus constantly high usage was observed.

What you expected to happen:

No significant change in memory usage.

How to reproduce it (as minimally and precisely as possible):

Run {job=".+"} on Prometheus with some metrics for either version and compare memory usage.

Full logs to relevant components:

Anything else we need to know:

Heap profiles for 0.32.2 and 0.32.3 with the same query on the same Prometheus node:

thanos-0 32 2-heap

thanos-0 32 3-heap

mkrull commented 1 month ago

This comment probably refers to the same issue: https://github.com/thanos-io/thanos/issues/6744#issuecomment-1789575167

GiedriusS commented 1 month ago

I think it's a consequence of https://github.com/thanos-io/thanos/pull/6706. We had to fix a correctness bug and as a consequence, responses need to be sorted in memory before being sent off. Unfortunately, but Prometheus sometimes produces not a sorted response and that needs to be fixed upstream. Or external labels functionality has to be completely reworked. See https://github.com/prometheus/prometheus/issues/12605

mkrull commented 1 month ago

Ouch, I see. Upgrading in environments like Kubernetes comes with a considerable new risk of OOMs for pods running Prometheus with Thanos sidecar because it gets really hard to estimate max memory requirements for the sidecar containers 🤔