thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.91k stars 2.07k forks source link

Receive's memory usage continues to grow in v0.31.0-rc.0 #6176

Open yutian1224 opened 1 year ago

yutian1224 commented 1 year ago

Thanos, Prometheus and Golang version used: Thanos: v0.31.0-rc.0

Object Storage Provider: S3

What happened: I went from 0.30.2 to the new version around 8 o'clock and noticed that the memory kept growing until I rolled back. image

Args:

receive
--log.level=info
--log.format=logfmt
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--remote-write.address=0.0.0.0:19291
--objstore.config=$(OBJSTORE_CONFIG)
--tsdb.path=/var/thanos/receive
--label=thanosreplica="$(NAME)"
--label=receive="true"
--tsdb.retention=1d
--receive.local-endpoint=$(NAME).$(NAMESPACE).svc.cluster.local:10901
--receive.grpc-compression=snappy
--tsdb.out-of-order.time-window=1h
--store.limits.request-samples=1000
--store.limits.request-series=10000
fpetkovski commented 1 year ago

Would you mind posting a graph of head series and samples ingested for the same time period?

The metrics are prometheus_tsdb_head_series and prometheus_tsdb_head_samples_appended_total.

yutian1224 commented 1 year ago

I'm afraid we didn't collect these metrics, the figure below is the series_count_by_metricname data in the /api/v1/status/tsdb, which is actually not a big increase compared to 0.30.2. image

fpetkovski commented 1 year ago

@PhilipGough @saswatamcode would it be possible to test the RC with your load testing framework to see if there's a memory regression with Thanos itself?

douglascamata commented 1 year ago

@yutian1224 do you know if there might have been queries being executed in your cluster that could be touching the "hot data" in Receives?

You said you rolled back, but the right edge of the chart still shows a trend upwards. How's the memory usage since you rolled back?

yutian1224 commented 1 year ago

@douglascamata Our receive is mainly used for grafana chart query and alarm settings. I don't know if the "hot data" here refers to the alarm part. The alarm query usually has a fixed interval and is continuous.

As shown in the figure below, the memory usage before updating 0.31.0 and after rolling back is relatively stable. image

philipgough commented 1 year ago

@fpetkovski I wont get a chance to do so this week due to other commitments but I can check next week. What I can say now is that we are already running the RC in one of our production environments and we are not seeing the issues reported here

yutian1224 commented 1 year ago

@PhilipGough I tested the instance without the --store.limit parameter and the results were stable. So I suspect it is the problem caused by --store.limit

philipgough commented 1 year ago

@yutian1224 interesting, thanks for confirming. We were indeed running the RC without those limit flags.

cc @fpetkovski

fpetkovski commented 1 year ago

I can test later this week, thanks for looking into it.

matej-g commented 1 year ago

Interesting, @fpetkovski I assume it then could be https://github.com/thanos-io/thanos/pull/6074, I just realized we only added those flags this release.

Would be interesting to see a profile, where all the memory is being hogged.

fpetkovski commented 1 year ago

I enabled these flags in our staging environment but could not reproduce the described memory issue. @yutian1224 are you able to reproduce this problem consistently?

yutian1224 commented 1 year ago

@fpetkovski Yes, except for the first time, I also made a comparison before and after removing the limit flag, and this problem was also reproduced. So will it might be a problem of large amounts of data? At present, each receive instance in our environment receives about 3 million series.

fpetkovski commented 1 year ago

3M series should not be that much data. Would you mind providing a heap profile when you reproduce the issue? You can get it by hitting the /debug/pprof/heap endpoint on the receiver port 10902.

yutian1224 commented 1 year ago

@fpetkovski Sure, I'll test it on the weekend

fpetkovski commented 1 year ago

I enabled these two flags in staging and ran Receivers for about a day. I cannot reproduce the memory leak so I think we can release 0.31.0 as it is now. Once we have the heap profile we can check whether the limits are the culprit of this and cut 0.31.1 if necessary.

image
matej-g commented 1 year ago

Sounds good to me, thanks for checking @yutian1224 and @fpetkovski 👍

yutian1224 commented 1 year ago

@fpetkovski I deployed 0.31.0 and enabled limit flags yesterday, the memory problem reappeared. The zip file is the pprof heap file captured when the memory usage was about 48%.

image receive.prof.zip

douglascamata commented 1 year ago

FYI I took the profile and uploaded to this web visualization tool: https://flamegraph.com/share/7c78f5a0-cfa9-11ed-9b0d-d641223b6af4.

I'm not sure what's the problem, but github.com/thanos-io/thanos/pkg/receive.newReplicationErrors caught my attention: 7.6 GB of heap there. 🤔

philipgough commented 1 year ago

I wonder is there some contention caused by these low read limits on receivers that is effecting the ingestion path.

@yutian1224 Can you confirm if the limits were being hit or not? Can you increase your previous limits * 100 and see if the problem remains?

yutian1224 commented 1 year ago

@PhilipGough I am pretty sure the limits were being hit. The right side is the network traffic at that time. After adding the limit, you can see that the out traffic has decreased significantly, but the in traffic has not changed significantly. image

fpetkovski commented 1 year ago

This is really interesting. Is the yellow line outgoing traffic and why is it negative?

yutian1224 commented 1 year ago

@fpetkovski For the convenience of display, we display the incoming and outgoing traffic in one panel, which is represented by positive and negative.😄