Right now the posting lists are compressed using delta encoding + bitpacking.
In reality the documents are strictly monotonous, so the delta is always >0.
We could bitpack
(doc_n+1 - doc_n - 1) instead of (doc_n+1 - doc_n)
The benefit is usually tiny. We save 128 when this change made it possible to change the bitwidth used for the block.
For a block with bitwidth b, assuming a uniform distirbution (a suspicious but useful hypothesis), we have roughly speaking 1 / 2^(b - 1) to gain 128 bits.
(8 + 128(b-1))/(8 + 128b)
So we have 1/ 2^(b-1) chances to improve our compression rate by a factor of 16 / (1 + 16*b).
For a bitwidth of 1, a full saturation that's 100% chance to bring our block of 136bits down to 8 bits.
For a bitwidth of 2, (roughly > 90% saturation) this gives us a chance to bring our block of 264bits down to 136 bits.
Above it gets rapidly not as interesting.
This should not hurt performance too much provided we do this when we have the info in a bitpacking register.
Currently we have two unused bits in the bitwidths that we could use to ensure backward compatibility.
bit 7 could for instance encode whether we want this scheme or not.
The implementation is trickier than it may seem unfortunately.
(In particular what should be the initial value we use)
Right now the posting lists are compressed using delta encoding + bitpacking.
In reality the documents are strictly monotonous, so the delta is always >0. We could bitpack
(doc_n+1 - doc_n - 1)
instead of(doc_n+1 - doc_n)
The benefit is usually tiny. We save 128 when this change made it possible to change the bitwidth used for the block. For a block with bitwidth b, assuming a uniform distirbution (a suspicious but useful hypothesis), we have roughly speaking
1 / 2^(b - 1)
to gain 128 bits.(8 + 128(b-1))/(8 + 128b)
So we have
1/ 2^(b-1)
chances to improve our compression rate by a factor of16 / (1 + 16*b)
.This should not hurt performance too much provided we do this when we have the info in a bitpacking register.
Currently we have two unused bits in the bitwidths that we could use to ensure backward compatibility. bit 7 could for instance encode whether we want this scheme or not.
The implementation is trickier than it may seem unfortunately. (In particular what should be the initial value we use)