thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.1k stars 2.1k forks source link

Guidelines scaling thanos compact #1964

Closed jaseemabid closed 4 years ago

jaseemabid commented 4 years ago

At Monzo we have over 100TB of Thanos metrics and we are noticing some serious performance bottlenecks. Since downsampling must be run as a singleton, we are building up a huge backlog right now as the max performance we are able to get out of our current setup is around 16MB/s. Compaction performance is also comparable but we manage this with a few shards.

This issue is about requesting guidelines for other people with similar scale.

Here is a sample downsampling event.

level=info ts=2020-01-08T13:44:33.768943954Z caller=downsample.go:284 msg="downsampled block" from=01DWRR1MBJE3G61NY0PKB91ZVT to=01DY2F87NB1S6J2J61PAXXXJ89 duration=2h4m45.237070743s

This is a 2w long block with 5m resolution downsampled to 1h.

01DWRR1MBJE3G61NY0PKB91ZVT is about 244 512MB chunks with 1.5GB Index. numSamples: 29904945369, numSeries: 7777499, numChunks: 73985300

01DY2F87NB1S6J2J61PAXXXJ89 is 1.2GB index with 26 512MB chunks. numSamples: 2546495789, numSeries: 7777499, numChunks: 28488448

Reducing the resolution from 5m to 60m lead to a 12x reduction in number of samples, no of chunks and total storage. Number of series is the same as expected.

0:23 spent downloading data (11:17 -> 11:40) 2:04 spent down sampling 120GB at about 16MB/s CPU pegged to 1 core with ~20MB/s read from disk and about 30GB memory usage.

It looks like the bottleneck is the single threaded thanos process rather than network or disk. Any tips on how to make this faster? I thought I'd start a discussion here before starting to CPU profile the compactor.

This is one of the smaller blocks, we have seen some raw blocks approach almost a TB in size and some downsampling/compactions take almost 11-12hrs.

image

Environment:

bwplotka commented 4 years ago

Thanks for the detailed description of the problem (:

Are you sharding at the end? If yes, how far you can get with sharding? Because sharding, in the same way, helps to quickly catch up with downsampling.

Within one "block stream" 11-12h is indeed a bit slow, although compacting to 2w (and downsampling of 2w) happens only every 2w at maximum, so compactor can usually catch up in between.

Anyway, let's discuss what we can improve here, some initial thoughts:

  1. I believe we can add more concurrency to the downsampling process (we process each series sequentially).
  2. Adding
  3. From v0.10.0 (release soon) block loading memory consumption should be reduced. This means we might be able to add more concurrency to single shard (downsample multiple blocks "streams"). It should also improve memory of compaction, so we can as well increase concurrency and compact multiple streams concurrently. As a side effect compaction will take even longer for v0.10.0, see this. I believe this might make the latency even worse for you case.
  4. What's the cardinality of your blocks? How many series, samples?
  5. Wonder if it makes sense to split the block at some point if it comes to TB size.... even upload and load takes a lot of time, not mentioning manual operations/debug if needed.

cc @pracucci cc @brian-brazil (:

bandesz commented 4 years ago

@bwplotka I work with @jaseemabid. It seems to us that downsampling doesn't consider the relabel config in the latest release (https://github.com/thanos-io/thanos/blob/0833cad83db8b257a2275ec83a3d034c73659056/cmd/thanos/downsample.go#L168). I've seen the latest commits that the new meta syncer solves this problem, so I suspect we have to wait until 0.10 is released or run from master.

Just for reference our biggest index files are currently around 40Gb (but previously we managed to hit the 64Gb tsdb index limit) and I think the biggest block had ~2400 chunks, so around 1.2TB.

stale[bot] commented 4 years ago

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

bwplotka commented 4 years ago

In review: https://github.com/thanos-io/thanos/pull/1922

aaron-trout commented 4 years ago

We are also running into similar issues so good to see we are not the only ones πŸ˜†

Screenshot 2020-02-26 at 08 46 33

Our metrics bucket is a lot smaller than Monzo's, but as you can see the CPU is still the bottleneck here. One issue I think is a gap in the Thanos docs with guidelines on what numbers are sensible for the compactor. Here are the flags we are using on compactor:

        - --log.level=debug
        - --retention.resolution-raw=14d
        - --retention.resolution-5m=6m
        - --retention.resolution-1h=10y
        - --consistency-delay=30m
        - --objstore.config-file=/etc/thanos/objstore-config.yaml
        - --data-dir=/data
        - --wait

I did try bumping the --compact.concurrency from the default of 1 but this did not seem to allow the compact process to use more than 1 core still.

Another thing which would be good to know... is there an easy way to look at the current status of the metrics in the bucket; i.e. to find out how much of a backlog of work the compactor has right now. Presumably there is at least some backlog, since the compactor is constantly doing work; as soon as it finishes working on some blocks there are always more it can pick up right away.

In other news though, the 0.10 release certainly did reduce memory usage! In this staging cluster pictured above, the memory usage on compactor went down from ~4GB to <0.5GB!

bwplotka commented 4 years ago

Glad to hear!

Let's see if improved compactor is good enough for you guys (:

On Wed, 26 Feb 2020 at 08:55, Aaron Trout notifications@github.com wrote:

We are also running into similar issues so good to see we are not the only ones πŸ˜†

[image: Screenshot 2020-02-26 at 08 46 33] https://user-images.githubusercontent.com/2394679/75327942-1c73f200-5875-11ea-8e37-0cf48de46885.png

Our metrics bucket is a lot smaller than Monzo's, but as you can see the CPU is still the bottleneck here. One issue I think is a gap in the Thanos docs with guidelines on what numbers are sensible for the compactor. Here are the flags we are using on compactor:

    - --log.level=debug

    - --retention.resolution-raw=14d

    - --retention.resolution-5m=6m

    - --retention.resolution-1h=10y

    - --consistency-delay=30m

    - --objstore.config-file=/etc/thanos/objstore-config.yaml

    - --data-dir=/data

    - --wait

I did try bumping the --compact.concurrency from the default of 1 but this did not seem to allow the compact process to use more than 1 core still.

Another thing which would be good to know... is there an easy way to look at the current status of the metrics in the bucket; i.e. to find out how much of a backlog of work the compactor has right now. Presumably there is at least some backlog, since the compactor is constantly doing work; as soon as it finishes working on some blocks there are always more it can pick up right away.

In other news though, the 0.10 release certainly did reduce memory usage! In this staging cluster pictured above, the memory usage on compactor went down from ~4GB to <0.5GB!

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/1964?email_source=notifications&email_token=ABVA3OYDBO4PZ4LOWUIWIF3REYVB3A5CNFSM4KEKKJ2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM7LH7A#issuecomment-591311868, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3O66ITUYS3D36JJKR73REYVB3ANCNFSM4KEKKJ2A .

stale[bot] commented 4 years ago

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.