knizhnik / imcs

In-Memory Columnar Store extension for PostgreSQL
Apache License 2.0
202 stars 33 forks source link

feature request:support lz4 compression #29

Open amutu opened 10 years ago

amutu commented 10 years ago

I see RLE in imcs,it is greate,but I think the benefit is limited for real world compare with common compression algorithm such as lz4,lzo,snappy. infobright can reach 10:1~40:1 compression ratio,may be imcs can do this too. I would like to see imcs support lz4 compression,it can greate save memory and disk io. Is it possible? I will appreciate if this feature can be added.If you are busy to implement it, may be I can contribute some code(I have 4 years java coding experience,learn c and c++ in the university.My idea is only compress the leaf page,so the over head is small)? ref: http://code.google.com/p/lz4/ https://www.infobright.com/index.php/products/compression/

knizhnik commented 10 years ago

One of the main advantages of vertical model is better and faster compression than for traditional horizontal model. For example PostgreSQL is using compression for TOAST data.

But do not consider compression of magic tool which can help you to reduce storage demands 10 times for free. First of all compression ratio greatly depends on the data you are compressing. Actually all compression algorithms are based on redundancy in data. For really random data all compression algorithms (including infobright) will ratio ~ 1, not ten. For text documents usual compression ration is about 3... Also compression ratio depends on size of compression window or compressed document size. In DBMS usually unit of compression is page. Typical pages size (4/8kb) is small enough to reach high compression ratio. So even if infobright is able to reduce size of some file after compression ten times, most likely results will be less exiting if you split this file into 4kb pages and individually compress each page.

So high compression ration can be reached only if input data contains a lot of duplicate values. RLE is able to compress only subsequent duplicate values. More sophisticated compression algorithms are able to locate arbitrary duplicating sequences of bits. Some compression algorithms are optimized for compression of similar values, by first calculating difference between subsequent values and encoding this difference.

IMCS is first of all oriented on financial data, i.e. stock options... If you look at such data, then you can notice that even for stable symbols price is used to vary a little bit. So it is unlikely that price is not changed. Also ask, bid, open, close, high, low prices are used to be stored as floating point numbers. Even very close floating point number can have large number of different bits. So encoding them using delta-based algorithms is problematic.

Compression has most effect for character data (like symbol name). But here are more "radical" approaches. First of all you can use symbol name is timeseries identifier. So you will not have to repeat it multiple times for all records. If it is not possible or if string can be very long, it is possible to place it in separate table (dictionary) and use integer identifier instead of string.

One more moment. Compression is used to produce records with varying length. To allocate/free space for such records we need sophisticated memory allocator. PostgreSQL doesn't provide such one. ShmemAlloc does not allow to release memory at all! IMCS maintains its own list of fixed size pages to make it possible to reclaim space. With varying size pages this trick will not work.

Compression adds significant CPU overhead. Look for example at LZ4. Its compression speed is about 400Mb/sec. Modern SSDs providers even higher write speed. It means that instead of trying to use compression to fit data in RAM, we can just write it to the disk at the same speed! And reported compression ratio for LX4 is 2. Not 10! 16Gb or 32Gb is certainly large difference. But it is not a principle difference. If you have 10Tb of data, then compression in any case will not allow you to fir them in RAM. And if you have 10Gb of data, then it can be more efficient to buy server with 10Gb rather then try to compress this 10Gb to 5Gb...

If you want, we can try to perform experiments with real data you have and estimate possible effect of compressing this data using LZ4, RLE or any other compression algorithms...

amutu commented 10 years ago

We use imcs as a ad hoc reporter,one of our table looks like this:

log_time:timestamp---when this data is produce,selective uid:bigint ---user id cgi_id:int----cardinal number is 30 clientversion:int----cardinal number is 200,has hot spot country:varchar----USA,Chain,Japan etc...cardinal number is ~ 100,has hot spot province:varchar---cardinal number is ~ 1000,has hot spot city:varchar---cardinal number~10000,has hot spot device:varchar---iphone4,nokiaX etc, cardinal number is 1000,has hot spot net:varchar-----wifi,Vodafone 3G... cardinal number is 10,hash hot spot os:varchar----iOS6.x,Android 2.3 cardinal number is 100,hash hot spot kernel-----linux 3.0... cardinal number is 100 consuming:int-----cardinal number is 30

We get billions of rows like this append to imcs every day,and we hope date of the latest 14 days even lastest 30 days in the memory and can be access quickly. the query mostly looks like: select count(1) from t where log_time ::date = 'yesterday'::date and consuming > 5 group by country,device order by 1 desc limit 10; I think expcept log_time and may be uid, other columns will all be benifit with compression.

LZ4 compression speed is 400MB,but decompression speed is 1.8GB,for most write once,read many OLAP system,write data speed will not critical compare with read.

In my memory some file system ,may be ZFS or birtfs can judege if data can be compress at run time.for example,if you put jpg/rmvb/zip file to the fs with compression flag set,it can skip compress these files which already compressed.It try to compress the first 10% of the data block and if the compression ratio is not good, it will not compress the whole data.This is sophisticated and we can begin with simple compress all or not.

knizhnik commented 10 years ago

Certainly for such schema compression of country, province, city, device, net and os can be very useful. But IMHO it will be more efficient to replace them with integers, i.e. have separate tables (dictionaries) with string->id mapping. Certainly it may increase insertion time. But it can be addressed by using in-memory copy of dictionaries (std::map in C++, HashMap in Java,...) In this case insert speed may be even increased (smaller record size). You can use this integer identifiers for grouping/aggregation instead of original values. For example you query has not to be changed and will correctly work even if country and device names will be replaced with integers. And if you need to output for example name of the country, you should perform join with dictionary table. As far as I understand your queries will mostly perform some aggregation, so size of their result set will not be very large and join will not take some noticable amount of time.

Certainly I can implement compression of this fields at page level. But it will be significantly less efficient. Even if there are 1000 different devices, it can happen that among about 100 devices at single page their will be 50 unique devices. In this case compression ration will be less than 2.

And with dictionary approach ratio will be fixed: average size of name/ size of identifier. What is the maximal length for example of country? 8 bytes, 16 bytes, ...? And identifeir can be coded in just two bytes. There is also specified (and limitation) of IMCS, which requires all columns to have fixed size. So we have either to cut "long" country names, either waste a lot of space.

I expect that distribution of this data is far from normal, so among 1000 devices, 90% will belong to 10 most popular devices. In this case RLE compression can be efficient. Certainly it takes in account only subsequent values. And if we have sequence of values, {"iPhone 4.0", "iPhone 5.0", "iPhone 4.0", iPhone 5.0",...}, then RLE can do nothing. But all alternative compression schema significantly complicates lookup of element by position - we have to decompress page first in order to locate element at the specified position.

But I will think more about this feature.

knizhnik commented 10 years ago

Now IMCS is able to load unlimited varying size strings. Them are converted into integer identifiers using dictionary. Dictionary size should be larger than total capacity of all such columns. If dictionary size is less than 64kb, then identifiers are stored in 16 bit integers. So instead of original string value you are storing just two bytes.

Below is an example of usage:

create table CrashLog( log_time timestamp, ---when this data is produce,selective uid bigint, ---user id country varchar,----USA,Chine,Japan etc...cardinal number is ~ 100,has hot spot city varchar, ---cardinal number~10000,has hot spot device varchar, ---iphone4,nokiaX etc, cardinal number is 1000,has hot spot net varchar, -----wifi,Vodafone 3G... cardinal number is 10,hash hot spot os varchar); ----iOS6.x,Android 2.3 cardinal number is 100,hash hot spot

insert into CrashLog values (timestamp('12-Apr-2014 11:54'), 10000001, 'USA', 'New York', 'iPhone4', 'wifi', 'iOS6.x'); insert into CrashLog values (timestamp('12-Apr-2014 11:55'), 10000002, 'Japan', 'Tokio', 'Sony Xperia Z1', 'Vodafone 3G', 'Android 4.4'); insert into CrashLog values (timestamp('12-Apr-2014 11:56'), 10000003, 'China', 'Beijing', 'iPhone5', 'wifi', 'iOS7.x'); select cs_create('CrashLog', 'log_time'); log_time | uid | country
| city | device | net | os

----------------------------------------------------------------------------------------------+-----------------------------------+-------------------

--------+----------------------------------+------------------------------------------+---------------------------------+-----------------------------

timestamp:{2014-04-12 11:33:14.414614,2014-04-12 11:33:14.417347,2014-04-12 11:33:14.419531} | int8:{10000001,10000002,10000003} | varchar:{USA,Japan ,China} | varchar:{New York,Tokio,Beijing} | varchar:{iPhone4,Sony Xperia Z1,iPhone5} | varchar:{wifi,Vodafone 3G,wifi} | varchar:{iOS6.x,Android 4.4, iOS7.x} (1 row)

select cs_translate(group_by, 1) as country, cs_translate(group_by, 2) as device, agg_val as counter from (select (cs_project_agg(cs_hash_count(country||device))).* from CrashLog_get()) s; country | device | counter ---------+----------------+--------- Japan | Sony Xperia Z1 | 1 USA | iPhone4 | 1 China | iPhone5 | 1 (3 rows)