Open fulmicoton opened 7 years ago
I have been looking around src/postings/compression/mod.rs
to find a place, where I can start on this ticket.
If I understand it correctly, this is a specific case of compressing sorted values.
BlockEncoder::compress_block_sorted
calls out to bitpacker.compress_sorted
, which suggests that the solution to this ticket belongs to the bitpacking crate.
my approach would be to compute the delta, if they are all zero, return the value. The value will take 1 byte instead of 16 byte of block-compressed values + 1 byte for start_value.
The closer to SIMD we do it, the faster it should be.
I added a test case in a new branch
However, changing the compression also means we need to find it at the point of decompression and not mess up our offsets.
Am I on the right lines?
Using 7 low bits of 1 byte only encodes values up to 127.
As I understand it, the smallest amount of memory will have to be 2 bytes for 128u8..255u8.
Using 7 low bits of 1 byte only encodes values up to 127.
As I understand it, the smallest amount of memory will have to be 2 bytes for 128u8..255u8.
I think I might be mixing up VInt and block encoding using SSE3.
@petr-tik I think there was a bug misunderstanding on the ticket meaning. I tried to edit the title and the issue description to make it clearer.
This is very low priority... This is mostly important for analytics. There used to be one stakeholder for which this would have been a nice-to-have. I think we can let this ticket sleep for a while.
Great. Thanks for clarifying the description and low priority. I tagged my branch against it, might come back to it later.
The backstory: we had an analytics engine in which a lot of terms were associated to all (or almost all) documents in the index. This was very common for terms representing an AB test group for a feature that was rolled out (100% of traffic)
In that case, the first block for instance, would go
0, 1, 2, 3, .., 127
.The deltas, as they are computed today, would go
1,1,1,1,1...1
(1 is repeated 128 times). The delta are bitpacked over 1 bit each, so that overall the block uses 1 byte (to encode the bitpacking width, here 1) and128 * 1 / 8
= 16 bytes. Overall 17 bytes are used to encode this blocks.In reality, since posting lists are strictly increasing, we could replace
delta
bydelta-1
and reduce the size of our dense range considerably.In that case, we would have to encode
0,0,0,0...0
. Technically encodingdelta - 1
can save us space for any blocks. We expect the probabiltiy for block to take 16 bytes less than its original size to be around1 / 2^b
where b is the original bit width.So considering the block width initially was
1 + 16b
, our expected relative gain in term of compression size for a block of width b is .16b / (1 + 16b) 2^b
It does not really matter when the block width is large anyway...
So we are left with two possible approach...
If this is not much of a performance regression, encode
delta-1
instead of deltas. The other possibility is to hardcode the case where all of the deltas are "1s".This ticket is low priority at the moment.