quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.02k stars 670 forks source link

Optimize codec for the case where all delta in the block are 1. #167

Open fulmicoton opened 7 years ago

fulmicoton commented 7 years ago

The backstory: we had an analytics engine in which a lot of terms were associated to all (or almost all) documents in the index. This was very common for terms representing an AB test group for a feature that was rolled out (100% of traffic)

In that case, the first block for instance, would go 0, 1, 2, 3, .., 127.

The deltas, as they are computed today, would go 1,1,1,1,1...1 (1 is repeated 128 times). The delta are bitpacked over 1 bit each, so that overall the block uses 1 byte (to encode the bitpacking width, here 1) and 128 * 1 / 8 = 16 bytes. Overall 17 bytes are used to encode this blocks.

In reality, since posting lists are strictly increasing, we could replace delta by delta-1 and reduce the size of our dense range considerably.

In that case, we would have to encode 0,0,0,0...0. Technically encoding delta - 1 can save us space for any blocks. We expect the probabiltiy for block to take 16 bytes less than its original size to be around 1 / 2^b where b is the original bit width.

So considering the block width initially was 1 + 16b, our expected relative gain in term of compression size for a block of width b is .

16b / (1 + 16b) 2^b

It does not really matter when the block width is large anyway...

So we are left with two possible approach...

If this is not much of a performance regression, encode delta-1 instead of deltas. The other possibility is to hardcode the case where all of the deltas are "1s".

This ticket is low priority at the moment.

petr-tik commented 5 years ago

I have been looking around src/postings/compression/mod.rs to find a place, where I can start on this ticket.

If I understand it correctly, this is a specific case of compressing sorted values.

BlockEncoder::compress_block_sorted calls out to bitpacker.compress_sorted, which suggests that the solution to this ticket belongs to the bitpacking crate.

my approach would be to compute the delta, if they are all zero, return the value. The value will take 1 byte instead of 16 byte of block-compressed values + 1 byte for start_value.

The closer to SIMD we do it, the faster it should be.

I added a test case in a new branch

However, changing the compression also means we need to find it at the point of decompression and not mess up our offsets.

Am I on the right lines?

petr-tik commented 5 years ago

Using 7 low bits of 1 byte only encodes values up to 127.

As I understand it, the smallest amount of memory will have to be 2 bytes for 128u8..255u8.

petr-tik commented 5 years ago

Using 7 low bits of 1 byte only encodes values up to 127.

As I understand it, the smallest amount of memory will have to be 2 bytes for 128u8..255u8.

I think I might be mixing up VInt and block encoding using SSE3.

fulmicoton commented 5 years ago

@petr-tik I think there was a bug misunderstanding on the ticket meaning. I tried to edit the title and the issue description to make it clearer.

This is very low priority... This is mostly important for analytics. There used to be one stakeholder for which this would have been a nice-to-have. I think we can let this ticket sleep for a while.

petr-tik commented 5 years ago

Great. Thanks for clarifying the description and low priority. I tagged my branch against it, might come back to it later.