timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.36k stars 871 forks source link

Compression improvements using lossy techniques #2529

Open lrolsen opened 3 years ago

lrolsen commented 3 years ago

Hi,

Have you considered implementing any of the improvements that VictoriaMetrics have added to standard Gorilla compression?

They are outlined [here] (https://medium.com/faun/victoriametrics-achieving-better-compression-for-time-series-data-than-gorilla-317bc1f95932)

As a simple test I tried converting float8 to int8 in a table containing 224mio rows of financial data. From this: CREATE TABLE public.trade_double ( instrumentvenueid int4 NOT NULL, externaltradeid int4 NOT NULL, executedatutc timestamp NOT NULL, price int8 NOT NULL, qty int8 NOT NULL );

To this: CREATE TABLE public.trade_int ( instrumentvenueid int4 NOT NULL, externaltradeid int4 NOT NULL, executedatutc timestamp NOT NULL, price int8 NOT NULL, qty int8 NOT NULL );

With the following compression settings: ALTER TABLE trade_double SET ( timescaledb.compress, timescaledb.compress_segmentby = 'instrumentvenueid', timescaledb.compress_orderby = 'executedatutc ASC, externaltradeid' );

The conversion was done by multiplying the price and qty columns by 1000000.

Uncompressed the table was 24GB. Compressed with doubles the size went down to 3347 MB whereas as it went down to 2047 MB with int's. So a 39% reduction with that simple change.

mfreed commented 3 years ago

Hi @lrolsen thanks for the request.

We deemed it very important that our compression algorithms as built are lossless and correct by default, such that they don't currently gain higher compressibility by using lossy compression.

That's the case with VictoriaMetrics' approach, even though their write-ups or blog posts don't overtly acknowledge this. When you insert then subsequently query data, the result can actually be different than the initial values written (because VictoriaMetrics can throw away precision to achieve higher compressibility). They would argue it's "good enough" for the Prometheus use case. TimescaleDB is used in a broader variety of settings than VictoriaMetrics; many of our users really want correctness and predictability.

This correctness issue with VictoriaMetrics' compression is detailed in a write-up by Prometheus core contributors, as one example: https://www.robustperception.io/evaluating-performance-and-correctness

In the future, we might consider additional algorithms that are lossy (and certainly welcome contributions), it just hasn't yet been a priority.

Today, users are also very welcome to drop precision upstream for greater compressibility (for example, even if we continue to internally store floats, by just having fewer significant digits -> less entropy, you will achieve higher compression), or themselves process their data into integer form as you do above.