grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
3.96k stars 498 forks source link

Store-gateway: inefficient chunks fetching and caching #3939

Closed pracucci closed 10 months ago

pracucci commented 1 year ago

I analysed the bytes touched vs fetched in some of our production Mimir clusters over the last 24h. I used the metrics cortex_bucket_store_series_data_size_touched_bytes_sum and cortex_bucket_store_series_data_size_fetched_bytes_sum, and got the following (numbers are GBs):

Screenshot 2023-01-12 at 15 31 06

Ideally, we would expect Fetched to be less than Touched, because some data will be fetched from the cache (series and postings) and similar for chunks. That's true for "series" and "postings", but not for "chunks". We touched 58TB of chunks, but fetched 504TB (9x more).

Why the Touched vs Fetched discrepancy?

First of all, we need to understand how the metric is tracked for chunks:

Why we fetch 9x more than the actual chunk sizes? This is an effect of two different implementation details:

  1. When looking up the index, we don't know the chunk length, so we assume the worst case scenario of 16000 bytes per chunk (here and here).
  2. In order to reduce the number of API calls to the object storage, we use a partitioner to aggregate close (but not adjacent) byte ranges into the same GET request (code). Basically we intentionally over read from the object storage.

Why does this affect cache too?

Chunks caching is implemented as a wrapper of the object storage client. We don't cache individual chunks, but we do cache portions of objects containing chunks (called "TSDB segment files"). Each cached portion is 16KB.

The current logic is as follows:

  1. Run the partitioner, extending the ranges to fetch
  2. For each partition, call object storage GetRange()
    1. The "caching bucket" lookups from the cache all 16KB portions within the requested range
    2. The "caching bucket" delegates to the underlying object storage client to read missing ranges from the object storage

Since cache lookup happens after the partitioner, it means that we're reading from memcached even ranges we don't actually need and will be discarded.

How much does the partitioner over read?

Querying partitioner metrics, we can see how much over read is done by the partitioner:

Screenshot 2023-01-12 at 15 54 23

We requested 323TB and the partitioner expanded it to 521TB. Looking at this data, the partitioner is over-reading by a factor of 1.6x which is not that bad.

Why partitioner effect is 1.6x, but fetched vs touched bytes is 9x?

My theory is that the reason is that we compute initial chunk ranges based on the worst case of 16000 bytes per chunk (since we don't know the actual chunk).

We know on average a sample is 1.4 bytes. A chunk is on average 120 samples, so 120 * 1.4 = 168 bytes which is way far from the worst case scenario we consider.

We also know that the average scrape interval across all our customers is 30s. Assuming chunks are consecutive in the segment file and we query 24h blocks (the largest block compacted by Mimir), all chunks of a series are in a range of (86400 / 30) * 1.4 = ~4KB which is still 4x smaller than the minimum range of 16KB we fetch from the cache.

Summing together the two issues, and the fact that cached portions are also aligned to 16KB, math gets quite close to a 9x over-reading inefficiency.

Reference queries

GBs touched:

sum by(data_type) (increase(cortex_bucket_store_series_data_size_touched_bytes_sum{namespace=~"..."}[1d]))
/ 1024^3

GBs fetched:

sum by(data_type) (increase(cortex_bucket_store_series_data_size_fetched_bytes_sum{namespace=~"..."}[1d]))
/ 1024^3

Partitioner's requested bytes total:

sum (increase(cortex_bucket_store_partitioner_requested_bytes_total{namespace=~"..."}[1d]))

Partitioner's expanded bytes total:

sum (increase(cortex_bucket_store_partitioner_expanded_bytes_total{namespace=~"..."}[1d]))
pstibrany commented 1 year ago

Chunk size frequency from a random 24h block from our monitoring cluster (30 segment files out of 249 in the block, around 95M of chunks):

$ tsdb-chunks 00* | mygrep -r '$1' -m -nh 'length: (\d+) ' | promfreq -mode=exponential -factor=2 -count=15

      (-∞ .. 1] ▏ 0 (0.0 %)
       (1 .. 2] ▏ 0 (0.0 %)
       (2 .. 4] ▏ 0 (0.0 %)
       (4 .. 8] ▏ 0 (0.0 %)
      (8 .. 16] ▏ 0 (0.0 %)
     (16 .. 32] ▊ 745401 (0.8 %)
     (32 .. 64] █████████████████████████▊ 29349226 (30.8 %)
    (64 .. 128] ██████████████████████████████▏ 34301358 (35.9 %)
   (128 .. 256] ███████████▏ 12700317 (13.3 %)
   (256 .. 512] ████████████▍ 14075654 (14.7 %)
  (512 .. 1024] ███▊ 4228234 (4.4 %)
 (1024 .. 2048] ▏ 41523 (0.0 %)
 (2048 .. 4096] ▏ 32 (0.0 %)
 (4096 .. 8192] ▏ 0 (0.0 %)
(8192 .. 16384] ▏ 0 (0.0 %)
  (16384 .. +∞) ▏ 0 (0.0 %)

summary:
 count=95441745, p50=96.88731927173262, p90=416.0724723696675, p95=502.86448473371115, p99=913.4608069468243, avg=156.80059002483662, min=17, max=2134
dimitarvdimitrov commented 1 year ago

In an upcoming PR I'll be using the chunk actual length when fetching chunks from the bucket. For a series we can calculate the size of each chunk as the difference between the chunk refs of two consecutive chunks. This works on the assumption that chunks in the segment files are ordered by the series to which they belong. And this works for all chunks but the last chunk of a series. For the last chunk we still have to estimate its length. (Technically we can look at the chunk ref of the first chunk of the next series, but that's difficult right now)

A naive estimation

For this estimation I initially tried with an estimation of 2x the size of the largest chunk whose size we know for certain. This didn't work for cases where we have very few chunks (e.g. with high churn we might have one or two chinks per series) or when the chunks were sparse and covered very different time ranges (30s vs 30m).

Using an estimation of 2x caused a underestimation for 0.6% of series in some requests. This increased the number of requests to the object store by 25% (since we had to refetch those series individually and couldn't group them in the same request with the help of the partitioner). Latency was also negative affected.

Underestimation rate, object store request rate, and latency zone-a was using these estimations and my future changes, zones b and c were using the `main` implementation Screenshot 2023-02-06 at 14 52 50 Screenshot 2023-02-06 at 14 50 40 Screenshot 2023-02-06 at 14 50 48

Last chunk size analysis

This prompted me to find out what would be a better estimation for chunk sizes.

What I wanted to find out what given the size of the max chunk of a series (excluding the last chunk), how big will the last chunk be? Using the block index I ran a similar analysis as Peter's on the size of chunks in a few blocks. I bucketed the size of the max chunk into powers of two and calculated percentiles for the size of the last chunk (in my analysis I almost always knew the size of the last chunk).

TL;DR: There is a lot more variability to last chunk size when the max known chunk is small (<256). And for series where the max chunk size was big (<4096), then variability is lower and the last chunk was always below 2500 B.

__Results__ I analysed the chunks of 63M series from 6 blocks from different tenants. I wrote [this tool](https://github.com/grafana/mimir/tree/dimitar/chunks-sizes-analysis/tools/tsdb-chunks-len) to read and parse the index from the bucket and then wrote a small go program to parse the output, bucket the results and calculate percentiles. ``` Terms: * max chunk size: the size of largest chunk in timeseries _excluding_ the last chunk * last chunk size: actual size of the last chunk Explanation: 128 (n=0, 0%) N/A ^ | # There were no series where the size of the max chunk was <=128 256 (n=8229338, 13.0%) ^ | | # 13% (or 8229338) of series in the input had a max chunk size in (128, 256] p10.000 0.749035 194 B (9.0909%) ^ ^ ^ ^ | | | | # percentile of series where the max chunk was <=256B (p10) | | | # In the bottom 10% of series where max chunk was <=256B the last chunk was 0.749x that of the max chunk (sorted by ratio) | | # In the bottom 10% of series where max chunk was <=256B the last chunk was 194 B (sorted by last chunk) | # Series with last chunk larger than 0.749035x or 194 B as % of all series in the input. 16 (n=0, 0.0%) N/A 32 (n=3888565, 6.2%) p10.0000 0.896552 23 B (5.54874%) p50.0000 1.000000 28 B (3.08263%) p90.0000 1.793103 48 B (0.61653%) p99.0000 6.406250 171 B (0.06165%) p99.9000 17.392857 463 B (0.00617%) p99.9900 26.347826 744 B (0.00062%) p99.9990 38.652176 923 B (0.00006%) p99.9999 54.000000 1272 B (0.00001%) 64 (n=9939473, 15.8%) p10.0000 0.650000 31 B (14.18300%) p50.0000 0.954545 54 B (7.87945%) p90.0000 1.206897 63 B (1.57589%) p99.0000 5.293103 253 B (0.15759%) p99.9000 9.615385 483 B (0.01576%) p99.9900 20.027779 780 B (0.00158%) p99.9990 31.297297 1210 B (0.00016%) p99.9999 37.676472 1363 B (0.00002%) 128 (n=21282638, 33.7%) p10.0000 0.472973 41 B (30.36898%) p50.0000 0.757282 61 B (16.87166%) p90.0000 0.985075 102 B (3.37433%) p99.0000 1.972973 190 B (0.33743%) p99.9000 6.364407 517 B (0.03374%) p99.9900 9.296610 942 B (0.00338%) p99.9990 11.655556 1094 B (0.00034%) p99.9999 17.142857 1213 B (0.00003%) 256 (n=14146232, 22.4%) p10.0000 0.308140 53 B (20.18578%) p50.0000 0.712121 113 B (11.21432%) p90.0000 0.964286 192 B (2.24287%) p99.0000 1.300752 255 B (0.22429%) p99.9000 3.682609 640 B (0.02243%) p99.9900 5.264516 944 B (0.00224%) p99.9990 7.775194 1100 B (0.00023%) p99.9999 8.638462 1741 B (0.00002%) 512 (n=8905849, 14.1%) p10.0000 0.224319 77 B (12.70809%) p50.0000 0.816327 273 B (7.06005%) p90.0000 0.993528 400 B (1.41201%) p99.0000 1.113839 492 B (0.14120%) p99.9000 1.883721 701 B (0.01412%) p99.9900 3.221402 959 B (0.00141%) p99.9990 4.165385 1443 B (0.00014%) p99.9999 7.254613 1975 B (0.00001%) 1024 (n=4199038, 6.7%) p10.0000 0.238195 162 B (5.99176%) p50.0000 0.880510 540 B (3.32876%) p90.0000 0.990826 875 B (0.66575%) p99.0000 1.070946 1006 B (0.06658%) p99.9000 1.514925 1111 B (0.00666%) p99.9900 2.021277 1567 B (0.00067%) p99.9990 2.488916 1990 B (0.00007%) p99.9999 3.847195 2095 B (0.00001%) 2048 (n=655889, 1.0%) p10.0000 0.197719 262 B (0.93591%) p50.0000 0.676252 901 B (0.51995%) p90.0000 0.974245 1236 B (0.10399%) p99.0000 1.077839 1790 B (0.01040%) p99.9000 1.489215 2046 B (0.00104%) p99.9900 1.874067 2212 B (0.00010%) p99.9990 1.970027 2416 B (0.00001%) p99.9999 2.089918 2437 B (0.00000%) 4096 (n=54047, 0.1%) p10.0000 0.218945 475 B (0.07712%) p50.0000 0.532983 1171 B (0.04285%) p90.0000 0.860074 1869 B (0.00857%) p99.0000 1.006455 2230 B (0.00086%) p99.9000 1.086239 2385 B (0.00009%) p99.9900 1.136015 2437 B (0.00001%) p99.9990 1.158329 2482 B (0.00000%) p99.9999 1.158329 2482 B (0.00000%) +Inf (n=431, 0.0%) p10.0000 0.000000 55 B (0.00062%) p50.0000 0.000000 299 B (0.00034%) p90.0000 0.000000 876 B (0.00007%) p99.0000 0.000000 1318 B (0.00001%) p99.9000 0.128144 1783 B (0.00000%) p99.9900 0.128144 1783 B (0.00000%) p99.9990 0.128144 1783 B (0.00000%) p99.9999 0.128144 1783 B (0.00000%) ```

Making an estimation

How often this estimation can be below the actual size should be based on our cache hit rate since we wouldn't use the estimation for already cache chunks.

In production at Grafana Labs we usually have a usually high hit rate for the chunks cache (96.3% in the last 30 days over all Mimir clusters).

If we want to have to refetch 1 batch in 10, then with a hit rate of 96.3%, we can afford to underfetch (and refetch) the last chunk of 0.054% of series (1 /*one series*/ / (10 /*batches*/ x 5000 /*series per batch*/ x (1 - 0.963) /*cache miss rate*/) = 0.0005405405). So we need to overestimate 99.946% of last chunks.

The following estimations would satisfy this for each max chunk size bucket:

32 (n=3888565, 6.2%)
    p99.9000  17.392857    463 B (0.00617%)
64 (n=9939473, 15.8%)
    p99.9000  9.615385    483 B (0.01576%)
128 (n=21282638, 33.7%)
    p99.9900  9.296610    942 B (0.00338%)
256 (n=14146232, 22.4%)
    p99.9900  5.264516    944 B (0.00224%)
512 (n=8905849, 14.1%)
    p99.9000  1.883721    701 B (0.01412%)
1024 (n=4199038, 6.7%)
    p99.9000  1.514925   1111 B (0.00666%)
2048 (n=655889, 1.0%)
    p99.9000  1.489215   2046 B (0.00104%)
4096 (n=54047, 0.1%)
    p99.9999  1.158329   2482 B (0.00000%)

These estimations would cover 99.95063% of series.

Ratio vs chunk size

I'm not sure whether to do an estimation with a fixed size for each bucket or use a multiplier of the actual max chunk size on a per-series basis. The worst case is worse when using a ratio, but I'm not sure about the average case. I'm tempted to do with the fixed size since it looks easier.

dimitarvdimitrov commented 1 year ago

Here's long-overdue update: we've been running the fine-grained chunks changes at Grafana Labs for some time in February and March and more recently trialing it in a few clusters.

Unfortunately, we didn't see significant improvements in most Mimir clusters. In most cases it leads to a single digit percentage decrease in in-use heap at the cost of 20-50% increased object store operations and 10-20% increased latency. There was once cluster where the decrease in heap was ~25%, but that wasn't enough to justify keeping the feature.

So we took a decision to start removing the feature.