Open anas-aso opened 1 year ago
As similar issue was reported in another ticket for the Receive component: https://github.com/thanos-io/thanos/issues/6176#issuecomment-1491704718.
Does removing the --store.grpc.series-sample-limit=50000000
eliminate the spike?
@fpetkovski I just tried dropping that the limit, but the memory spike still happens.
@fpetkovski any other ideas to try regarding this is appreciated.
Unfortunately I am not aware of any other changes that could be contributing to the memory spike.
Hi.
I've observed a change in behavior between v0.30.2 and v0.31.0, regarding the type of memory used.
Both versions use the same amount of memory, but v <= 0.30.2 is using RSSFile (file cache?), and v0.31.0 is using RSSAnon. In my Kubernetes setup, this change triggers OOMkill since RSSAnon is taken into account for memory limit.
For v<=0.30.2 :
/ # cat /proc/1/status
[...]
VmPeak: 23497568 kB
VmSize: 23497568 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 22818140 kB
VmRSS: 22818140 kB
RssAnon: 1500400 kB
RssFile: 21317740 kB
RssShmem: 0 kB
VmData: 1557020 kB
VmStk: 140 kB
VmExe: 24052 kB
VmLib: 8 kB
VmPTE: 44788 kB
VmSwap: 0 kB
For v0.31.0 :
/ # cat /proc/1/status
[...]
VmPeak: 30499504 kB
VmSize: 30499504 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 26583296 kB
VmRSS: 26568004 kB
RssAnon: 24831888 kB
RssFile: 1736116 kB
RssShmem: 0 kB
VmData: 26235004 kB
VmStk: 140 kB
VmExe: 27896 kB
VmLib: 8 kB
VmPTE: 53868 kB
VmSwap: 0 kB
This PR could have fixed the issue: https://github.com/thanos-io/thanos/pull/6509
Upgraded a system from 0.28.0 to 0.32.0-rc.0 and this is still an issue:
Thanos, Prometheus and Golang version used: Thanos:
goversion="go1.19.7", revision="50c464132c265eef64254a9fd063b1e2419e09b7", version="0.31.0"
Prometheus:goversion="go1.19.2", revision="dcd6af9e0d56165c6f5c64ebbc1fae798d24933a", version="2.39.1"
Object Storage Provider: GCP Storage and AWS S3
What happened: Memory usage spike during startup after upgrading from 0.28.0 to 0.31.0. After the memory spike I downgraded and started upgrading gradually from 0.28.0. I noticed that the memory spike on start up happens only from 0.30.2 to 0.31.0. So the changes in 0.31.0 are the culprit.
What you expected to happen: Memory usage stays roughly the same.
How to reproduce it (as minimally and precisely as possible): We run Thanos on both GCP and AWS and I noticed the issue on both cloud providers.
POD args
This store exposes metrics that are older than ~30 days. Our retention is 2 years (the 2 years - 30 days store is very rarely queried, that's why we delegate it to a single instance).
Full logs to relevant components: There is nothing special in the logs, just a huge list of events like the one below :
```json { "@timestamp": "2023-03-31T10:15:18.234182290Z", "caller": "bucket.go:654", "elapsed": "5.849035528s", "id": "01FNAN7EDKBJ9762ZVSV0VDCSH", "level": "info", "msg": "loaded new block" } ```
Anything else we need to know: