quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.16k stars 673 forks source link

Possible to save 16 bytes on linear block metadata in the multilinear interpolation #1212

Open fmassot opened 2 years ago

fmassot commented 2 years ago

Currently, we serialize these values for each linear block:

It represents 29 bytes. But it's possible to remove 16 bytes.

Let's try to understand what we really need first and without bitpacking. We want to model some y values [y1, y2,...yn] as a linear function y = a.x + b

But we want all values to be positive, so we find b' to satisfy this requirement and in the end, we have: y = a.x + b'

We know that our y values will not match the linear function but we will write the difference between the expected values and the real ones on the disk: ywritten = yreal - (ax + b') (1)

To rebuild yreal, we only need to know ywritten, b', a. The x value is given by the caller so we don't need to write it.

But we also want to bitpack yw values. For that we need to store additional information, the num_bits used for bitpacking. And that's all:

Is it worth saving 16 bytes?

Saving 16 bytes is not changing the world nor is a priority. But removing variables that we don't need is a good thing because it makes the code more readable.

What can we do?

For backward compatibility, I think we can add a new codec in the future release that will be more optimized than the current one.

PSeitz commented 2 years ago

data_start_offset can be deduced by summing num_bits * block_size for each block before the block where we want to read. we can do it when we open the fast field file.

Saving 16 bytes is not changing the world nor is a priority. But removing variables that we don't need is a good thing because it >makes the code more readable.

We would still have the struct with data_start_offset, since we shouldn't calculate it on every access. So the removal of data_start_offset would be only in the serialized data, not in the code.

fmassot commented 2 years ago

We would still have the struct with data_start_offset, since we shouldn't calculate it on every access. So the removal of data_start_offset would be only in the serialized data, not in the code.

True. But I think we can also save some lines of code ^^.

fulmicoton commented 2 years ago

Is the slope actually useful in the blockwise linear thing or frame of reference sufficient for blocks of 512 elements? An alternative could be for instance to have a simpler code that takes 9 bytes per block, have smaller blocks (say 128 bytes), and no linear interpolation.

fmassot commented 2 years ago

@fulmicoton linear interpolation works great on sorted data, on HTTP logs, the compression ratio is 0.26 for bitpacking, 0.05 for multilinear. I tested your idea with blocks of 128 elements and I got 0.066 ratio which is not bad at all. You have comparisons available here but I did not yet add your idea. My intuition is that:

One more thing, I tested compression algorithms only on integers. On floating point, we may use other techniques than computing the delta, the gorilla paper talked about XOR:

We discovered that the value in most time series does not change significantly when compared to its neighboring data points. Further, many data sources only store integers. This allowed us to tune the expensive prediction scheme in [25] to a simpler implementation that merely compares the current value to the previous value. If values are close together the sign, exponent, and first few bits of the mantissa will be identical. We leverage this to compute a simple XOR of the current and previous values rather than employing a delta encoding scheme.

Blog post source

fmassot commented 2 years ago

I added the frame of reference implementation in #1217 to compare the results