thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.94k stars 2.08k forks source link

Compression #284

Closed mattbostock closed 6 years ago

mattbostock commented 6 years ago

Some very quick benchmarks on Production data using Facebook's zstd compression suggests that disk space used by chunks can be reduced by a factor of 2.5x and index files reduced by at least 3.x.

When storing petabytes of TSDB data (using 3x replication or erasure coding) that may then be backed up, these kinds of savings are worth pursuing. Even if the object store or the filesystem it uses supports compression, the network bandwidth saved is also significant.

Ideally all compression, depending on the overhead, would be in the TSDB library itself (e.g. https://github.com/prometheus/tsdb/issues/249), though Thanos use case is distinct enough that we might consider compressing blocks on upload/download to and from the object store. Opening this ticket to track potential options.

bwplotka commented 6 years ago

Nice findings, can we estimate how much latency it introduces? on top of download time?

If the overhead is not large, it might be a good idea. We don't really use chunks or index out of the box from object store. Even for repairs or verification adhoc tools, it does not matter if we add additional (de)compression step.

Interested in @fabxc opinion.

bwplotka commented 6 years ago

The question is, do we need this. Petabytes of TSDB data is... a lot (: If I am counting correctly, even terabyte is not achievable for most companies - even for over 1yr data, including redundancy of downsampled data and HA scrapers

fabxc commented 6 years ago

Thanks for investigating this! I noticed that by its nature our chunk compression has surprisingly many reptitive patterns a while ago. But I didn't expect another layer of compression being so effective.

The problem is of course, that once we compress entire files, we can no longer randomly access them, which is what Thanos is build on and how it can achieve good query latencies even against object storage.

So for pure archival, this would all be fine. But I don't see a straightforward way to do both. One could break up files into smaller chunks (not the same our normal chunks) again that are then compressed and always fetched if a requested range falls into them. But we'd need to keep a mapping of what actual range map to which compressed chunks.
For the index file specifically it would get extra hairy.

While the theoretical improvement is great, I believe overall we are doing extremely well in terms of data size compared to existing alternatives.
So I doubt it's worth pursuing something so complex at this point – if it is possible without deal breaker tradeoffs at all.


On doing it in TSDB directly. For chunks it may be feasible, for the index it gets much more complex. But you'd definitely group a few hundred chunks into compressed units again to get any benefit at all. Total compression ratio would certainly not reach the numbers of compressing the full file. Read I/O would definitely go up and query latency increase due to more read data and the additional computation.

fabxc commented 6 years ago

I think another main problem for pursing something like this is, that we have no clear query performance profile of TSDBs current state even. We know it's generally doing well, but lack data on what impacts ultimate query latency. That's owed one the one hand to the fully deferred evaluation and OTOH on mmap, which also causes actual reads to be deferred.

So if we added something like the above, we'd have a hard time judging how it shifts performance around.

mattbostock commented 6 years ago

I took the first chunk (000001) and the index file from a compacted block (spanning 18 hours) written by Prometheus 2.1.2.

I'm not suggesting zstd is the best option here, it's just the first thing I tried.

Block meta:

{
    "ulid": "01C9QVCDKFW2070XRCMAMTYA9T",
    "minTime": 1522216800000,
    "maxTime": 1522281600000,
    "stats": {
        "numSamples": 3698395705,
        "numSeries": 3958306,
        "numChunks": 31660562
    },
    "compaction": {
        "level": 3,
        "sources": [
            "01C9NXBPM8NDJZZDHZY4EXXHP5",
            "01C9P47DW4QEXHX3RXAKT8GD9D",
            "01C9PB35489N3W251ZHA7TN0QH",
            "01C9PHYWC4EBH5HAFRM7EHEMQ7",
            "01C9PRTKM5KT51535RYHQVPZG1",
            "01C9PZPAW5J0R7KNDV0RWRZV9W",
            "01C9Q6J247MFAHRV6RTQ1SFGA1",
            "01C9QDDSC592YWVJYY8PJND9BD",
            "01C9QM9GM69VDBDXJJXX6X7JSA"
        ]
    },
    "version": 1
}

Chunk

$ ls -alh 000001
-rw-r--r--  1 matt  staff   512M 12 Apr 11:46 000001
# Test ran on my laptop (Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz), while doing other things
# Compression ratio is in parentheses, followed by compression rate, then decompression rate
$ zstd -b1 -e22 000001 -o /dev/null
Benchmarking levels from 1 to 22
 1#000001            : 536864556 -> 275802287 (1.947), 249.4 MB/s , 601.9 MB/s
 2#000001            : 536864556 -> 244681820 (2.194), 174.0 MB/s , 563.4 MB/s
 3#000001            : 536864556 -> 230086353 (2.333), 127.3 MB/s , 581.1 MB/s
 4#000001            : 536864556 -> 225878032 (2.377), 113.8 MB/s , 573.9 MB/s
 5#000001            : 536864556 -> 224305136 (2.393),  56.6 MB/s , 541.6 MB/s
 6#000001            : 536864556 -> 217657382 (2.467),  31.5 MB/s , 570.8 MB/s
 7#000001            : 536864556 -> 213948842 (2.509),  21.8 MB/s , 538.7 MB/s
 8#000001            : 536864556 -> 211854560 (2.534),  19.2 MB/s , 534.8 MB/s
 9#000001            : 536864556 -> 210459603 (2.551),  15.2 MB/s , 568.1 MB/s
10#000001            : 536864556 -> 208035011 (2.581),  11.6 MB/s , 566.7 MB/s
11#000001            : 536864556 -> 204515614 (2.625),  9.46 MB/s , 474.7 MB/s
12#000001            : 536864556 -> 202896105 (2.646),  7.23 MB/s , 551.1 MB/s
13#000001            : 536864556 -> 199168307 (2.696),  5.57 MB/s , 508.0 MB/s
14#000001            : 536864556 -> 197814090 (2.714),  4.80 MB/s , 529.0 MB/s
15#000001            : 536864556 -> 194967240 (2.754),  3.38 MB/s , 489.1 MB/s
16#000001            : 536864556 -> 190168655 (2.823),  3.85 MB/s , 601.0 MB/s
17#000001            : 536864556 -> 186347635 (2.881),  3.20 MB/s , 580.0 MB/s
18#000001            : 536864556 -> 183627299 (2.924),  2.71 MB/s , 589.3 MB/s
19#000001            : 536864556 -> 181573064 (2.957),  2.37 MB/s , 550.7 MB/s
20#000001            : 536864556 -> 174809487 (3.071),  1.74 MB/s , 426.4 MB/s
21#000001            : 536864556 -> 173167400 (3.100),  1.57 MB/s , 416.0 MB/s
22#000001            : 536864556 -> 171953306 (3.122),  1.44 MB/s , 413.0 MB/s

Index

$ ls -alh index
-rw-r--r--  1 matt  staff   1.5G 12 Apr 11:06 index
$ zstd -b1 -e22 index -o /dev/null
Benchmarking levels from 1 to 22
 1#index             :1579860816 -> 580201924 (2.723), 296.5 MB/s , 657.5 MB/s
 2#index             :1579860816 -> 578631639 (2.730), 292.6 MB/s , 615.9 MB/s
 3#index             :1579860816 -> 562291573 (2.810), 211.4 MB/s , 619.5 MB/s
 4#index             :1579860816 -> 557336759 (2.835), 145.9 MB/s , 504.5 MB/s
 5#index             :1579860816 -> 552620476 (2.859), 103.4 MB/s , 510.4 MB/s
 6#index             :1579860816 -> 529761565 (2.982),  67.8 MB/s , 677.3 MB/s
 7#index             :1579860816 -> 527049052 (2.998),  47.3 MB/s , 514.3 MB/s
 8#index             :1579860816 -> 518433934 (3.047),  41.2 MB/s , 734.3 MB/s
 9#index             :1579860816 -> 517052328 (3.056),  36.8 MB/s , 682.3 MB/s
10#index             :1579860816 -> 514818517 (3.069),  31.2 MB/s , 764.1 MB/s
11#index             :1579860816 -> 502642881 (3.143),  25.9 MB/s , 748.6 MB/s
12#index             :1579860816 -> 502087438 (3.147),  14.5 MB/s , 720.8 MB/s
13#index             :1579860816 -> 501217631 (3.152),  11.3 MB/s , 740.6 MB/s
14#index             :1579860816 -> 500456004 (3.157),  7.63 MB/s , 569.1 MB/s
15#index             :1579860816 -> 498248964 (3.171),  5.82 MB/s , 644.5 MB/s
16#index             :1579860816 -> 478442609 (3.302),  4.17 MB/s , 723.8 MB/s
^C-index             :1579860816 ->
# Interrupted
mattbostock commented 6 years ago

Thanks @fabxc.

Good point about range requests on chunk data. But what about the index file? In my example above it's 1.5GiB, so reducing that by 3x would be beneficial.

I believe overall we are doing extremely well in terms of data size compared to existing alternatives.

Completely agree.

fabxc commented 6 years ago

Yea, the index file is really just more difficult even unfortunately. I'm surprised actually that this compresses even better as the bulk load is series definitions and the associated chunk pointers. Those are already delta compressed so entropy should be relatively high.

mattbostock commented 6 years ago

Sorry @fabxc I should have clarified in my last comment, I'm not clear as to why the index file is more difficult to compress.

For example, I'm thinking the Thanos shipper compresses the index file on upload, and the store instance decompresses it on download. As far as I know, the index file is not subject to range requests on the object storage.

I suspect the index file compresses so well due to the label names and values. As a side note, I'm seeing up to 5.6x compression ratios on this same index file using xz, though that's an extreme example and may be too slow to be practical (I'm thinking more about backups where xz is concerned).

mattbostock commented 6 years ago

My mistake, Thanos does perform range queries on index files.

One way round this would be to pre-fetch the index files in full, but that would make restarting the store nodes an expensive operation (if the files were not persisted locally) and would delay queries until the indices are fully loaded.

Should we close this? It seems any further file sizes gains would be best implemented in the TSDB library directly (and better tracked in that repository).

fabxc commented 6 years ago

I suspect the index file compresses so well due to the label names and values.

Strings are actually fully deduplicated already, i.e. only one copy exists of each. They may have partial overlaps of course (e.g. instance names) but in total only account for a few dozen kilobytes of the file.

Of course elsewhere we reference those strings. So the references are quite repetitive. There's an upstream issue to give smaller reference values to more frequent strings.

One way round this would be to pre-fetch the index files in full, but that would make restarting the store nodes an expensive operation (if the files were not persisted locally) and would delay queries until the indices are fully loaded.

Yes, that was actually the initial design because intuitively range queries against index files seemed too complex. Turned out it's actually pretty okay and fast.

Storing the whole index file indeed makes restarts (with a clean disk) very slow, but also would mean you'd need a lot of store nodes or really huge disks for them pretty quickly. It feels much more preferrable to have store nodes be bounded by query load they experience rather than amount of data they provide access to.

Should we close this? It seems any further file sizes gains would be best implemented in the TSDB library directly (and better tracked in that repository).

SGTM. Adding more custom complexity to Thanos will probably make it hard to stay compatible in the long run.

mattbostock commented 6 years ago

It feels much more preferrable to have store nodes be bounded by query load they experience rather than amount of data they provide access to.

Good point.

Thanks both! Closing.