Open obrienmd opened 3 years ago
@obrienmd currently we don't optimize compression specifically for JSON.
Thanks - data suggested that was the case :) Is it something you think might be interesting for the future?
Hi @erimatnor - are there future plans to allow this? In my case, performance is not a bottleneck so I'm happy if this were to come with a toggle. For now, what I'm thinking of doing as a workaround (using the following object as an example):
{
"city": "San Francisco",
"state": "CA",
"temperature": 86.07,
}
The table is created with a user_data
jsonb column
CREATE TABLE x (
time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
key text NOT NULL,
user_data jsonb NOT NULL,
PRIMARY KEY(time, key)
);
SELECT create_hypertable('x', 'time');
Before insert, inspect the user_data
object and extract out any numeric attributes and ensure the table schema can store the extracted attribute
ALTER TABLE x ADD COLUMN IF NOT EXISTS user_temperature DOUBLE NULL;
Insert the row
INSERT INTO x (user_data, user_temperature) VALUES '{"city":"San Francisco", "state": "CA"}', 86.07;
When reading from the table, looking for any user-defined columns and merging those values back into the user data object
Any issues you can see with this approach? I know the ideal scenario is to declare temperature column upon table creation, but I'm building a generic solution and I won't know what numeric columns the user would like to store until they create their spec.
Postgresql 14 added the ability for lz4 compression for toastable columns like jsonb.
On an existing table: ALTER TABLE tbl ALTER COLUMN col1 SET COMPRESSION lz4;
, but note that it will only compress newly inserted rows. A VACUUM FULL will (allegedly) solve this and recompress.
See: https://www.postgresql.fastware.com/blog/what-is-the-new-lz4-toast-compression-in-postgresql-14
I have not tested if and how this works fully with TimescaleDB.
Just to avoid any confusion: TimescaleDB has always supported compressing json/jsonb using LZ compression:
https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database/
We have seen good compression rates given the often repetitive nature of JSON/JSONB data. This thread is about building an even more bespoke compression algorithm for this data to potentially achieve even higher rates.
The current docs are not clear that LZ compression is used. They only mention "dictionary compression" and provide an example of associating a number with an entire value, which isn't illustrative of how the compression works in general.
Timescale's use of specific compression techniques for different data types makes for incredible compression ratios (and performance) with most of the columns I use it for, and for that I am truly thankful :)
Has the Timescale team considered specific compression techniques to target mostly-repeating JSONB columns? Without delving deeply into the code, it seems that JSONB is likely compressed per-record by LZ4, given the compression ratio I've observed with a sample column of large JSON objects in which only a single value changes between two adjacent records (time-wise). Regardless of the number of similar records in a chunk, the ratio seems relatively static.
I suspect that something like ZSON (looks to be not actively maintained), or a per-chunk dictionary for each JSONB column, could provide massive gains in compression for this data type.