timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.82k stars 885 forks source link

Feature: JSONB-specific compression techniques #2978

Open obrienmd opened 3 years ago

obrienmd commented 3 years ago

Timescale's use of specific compression techniques for different data types makes for incredible compression ratios (and performance) with most of the columns I use it for, and for that I am truly thankful :)

Has the Timescale team considered specific compression techniques to target mostly-repeating JSONB columns? Without delving deeply into the code, it seems that JSONB is likely compressed per-record by LZ4, given the compression ratio I've observed with a sample column of large JSON objects in which only a single value changes between two adjacent records (time-wise). Regardless of the number of similar records in a chunk, the ratio seems relatively static.

I suspect that something like ZSON (looks to be not actively maintained), or a per-chunk dictionary for each JSONB column, could provide massive gains in compression for this data type.

erimatnor commented 3 years ago

@obrienmd currently we don't optimize compression specifically for JSON.

obrienmd commented 3 years ago

Thanks - data suggested that was the case :) Is it something you think might be interesting for the future?

mattfysh commented 3 years ago

Hi @erimatnor - are there future plans to allow this? In my case, performance is not a bottleneck so I'm happy if this were to come with a toggle. For now, what I'm thinking of doing as a workaround (using the following object as an example):

{
  "city": "San Francisco",
  "state": "CA",
  "temperature": 86.07,
}
  1. The table is created with a user_data jsonb column

    CREATE TABLE x (
      time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
      key text NOT NULL,
      user_data jsonb NOT NULL,
      PRIMARY KEY(time, key)
    );
    SELECT create_hypertable('x', 'time');
  2. Before insert, inspect the user_data object and extract out any numeric attributes and ensure the table schema can store the extracted attribute

    ALTER TABLE x ADD COLUMN IF NOT EXISTS user_temperature DOUBLE NULL;
  3. Insert the row

    INSERT INTO x (user_data, user_temperature) VALUES '{"city":"San Francisco", "state": "CA"}', 86.07;
  4. When reading from the table, looking for any user-defined columns and merging those values back into the user data object

Any issues you can see with this approach? I know the ideal scenario is to declare temperature column upon table creation, but I'm building a generic solution and I won't know what numeric columns the user would like to store until they create their spec.

vincentmele commented 2 years ago

Postgresql 14 added the ability for lz4 compression for toastable columns like jsonb.

On an existing table: ALTER TABLE tbl ALTER COLUMN col1 SET COMPRESSION lz4;, but note that it will only compress newly inserted rows. A VACUUM FULL will (allegedly) solve this and recompress.

See: https://www.postgresql.fastware.com/blog/what-is-the-new-lz4-toast-compression-in-postgresql-14

I have not tested if and how this works fully with TimescaleDB.

mfreed commented 1 year ago

Just to avoid any confusion: TimescaleDB has always supported compressing json/jsonb using LZ compression:

https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database/

image

We have seen good compression rates given the often repetitive nature of JSON/JSONB data. This thread is about building an even more bespoke compression algorithm for this data to potentially achieve even higher rates.

jedwards1211 commented 6 months ago

The current docs are not clear that LZ compression is used. They only mention "dictionary compression" and provide an example of associating a number with an entire value, which isn't illustrative of how the compression works in general.