thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.12k stars 2.1k forks source link

Store: High memory usage on startup after upgrarding to 0.31.0 #6251

Open anas-aso opened 1 year ago

anas-aso commented 1 year ago

Thanos, Prometheus and Golang version used: Thanos: goversion="go1.19.7", revision="50c464132c265eef64254a9fd063b1e2419e09b7", version="0.31.0" Prometheus: goversion="go1.19.2", revision="dcd6af9e0d56165c6f5c64ebbc1fae798d24933a", version="2.39.1"

Object Storage Provider: GCP Storage and AWS S3

What happened: Memory usage spike during startup after upgrading from 0.28.0 to 0.31.0. After the memory spike I downgraded and started upgrading gradually from 0.28.0. I noticed that the memory spike on start up happens only from 0.30.2 to 0.31.0. So the changes in 0.31.0 are the culprit. Screenshot 2023-03-31 at 12 26 11

What you expected to happen: Memory usage stays roughly the same.

How to reproduce it (as minimally and precisely as possible): We run Thanos on both GCP and AWS and I noticed the issue on both cloud providers.

POD args

    spec:
      containers:
      - args:
        - store
        - --log.format=json
        - --data-dir=/var/thanos/store
        - --objstore.config-file=/thanos_config.yaml
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:19191
        - --consistency-delay=10m
        - --ignore-deletion-marks-delay=0s
        - --max-time=-719h
        - --store.grpc.series-max-concurrency=5
        - --store.grpc.series-sample-limit=50000000
        - --store.enable-index-header-lazy-reader
        image: thanosio/thanos:v0.31.0

This store exposes metrics that are older than ~30 days. Our retention is 2 years (the 2 years - 30 days store is very rarely queried, that's why we delegate it to a single instance).

Full logs to relevant components: There is nothing special in the logs, just a huge list of events like the one below :

Logs

```json { "@timestamp": "2023-03-31T10:15:18.234182290Z", "caller": "bucket.go:654", "elapsed": "5.849035528s", "id": "01FNAN7EDKBJ9762ZVSV0VDCSH", "level": "info", "msg": "loaded new block" } ```

Anything else we need to know:

fpetkovski commented 1 year ago

As similar issue was reported in another ticket for the Receive component: https://github.com/thanos-io/thanos/issues/6176#issuecomment-1491704718.

Does removing the --store.grpc.series-sample-limit=50000000 eliminate the spike?

anas-aso commented 1 year ago

@fpetkovski I just tried dropping that the limit, but the memory spike still happens.

anas-aso commented 1 year ago

@fpetkovski any other ideas to try regarding this is appreciated.

fpetkovski commented 1 year ago

Unfortunately I am not aware of any other changes that could be contributing to the memory spike.

demikl commented 1 year ago

Hi.

I've observed a change in behavior between v0.30.2 and v0.31.0, regarding the type of memory used.

Both versions use the same amount of memory, but v <= 0.30.2 is using RSSFile (file cache?), and v0.31.0 is using RSSAnon. In my Kubernetes setup, this change triggers OOMkill since RSSAnon is taken into account for memory limit.

For v<=0.30.2 :

/ # cat /proc/1/status
[...]
VmPeak: 23497568 kB
VmSize: 23497568 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:  22818140 kB
VmRSS:  22818140 kB
RssAnon:     1500400 kB
RssFile:    21317740 kB
RssShmem:          0 kB
VmData:  1557020 kB
VmStk:       140 kB
VmExe:     24052 kB
VmLib:         8 kB
VmPTE:     44788 kB
VmSwap:        0 kB

For v0.31.0 :

/ # cat /proc/1/status
[...]
VmPeak: 30499504 kB
VmSize: 30499504 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:  26583296 kB
VmRSS:  26568004 kB
RssAnon:    24831888 kB
RssFile:     1736116 kB
RssShmem:          0 kB
VmData: 26235004 kB
VmStk:       140 kB
VmExe:     27896 kB
VmLib:         8 kB
VmPTE:     53868 kB
VmSwap:        0 kB
fpetkovski commented 1 year ago

This PR could have fixed the issue: https://github.com/thanos-io/thanos/pull/6509

jpds commented 1 year ago

Upgraded a system from 0.28.0 to 0.32.0-rc.0 and this is still an issue:

thanos-store-api-memory-basic

yeya24 commented 1 year ago

@jpds I believe the issue in 0.32.0-rc.0 has been fixed? Please try v0.32.2 and see if it works for you