Open fmassot opened 2 years ago
data_start_offset can be deduced by summing num_bits * block_size for each block before the block where we want to read. we can do it when we open the fast field file.
Saving 16 bytes is not changing the world nor is a priority. But removing variables that we don't need is a good thing because it >makes the code more readable.
We would still have the struct with data_start_offset, since we shouldn't calculate it on every access. So the removal of data_start_offset would be only in the serialized data, not in the code.
We would still have the struct with data_start_offset, since we shouldn't calculate it on every access. So the removal of data_start_offset would be only in the serialized data, not in the code.
True. But I think we can also save some lines of code ^^.
Is the slope actually useful in the blockwise linear thing or frame of reference sufficient for blocks of 512 elements? An alternative could be for instance to have a simpler code that takes 9 bytes per block, have smaller blocks (say 128 bytes), and no linear interpolation.
@fulmicoton linear interpolation works great on sorted data, on HTTP logs, the compression ratio is 0.26 for bitpacking, 0.05 for multilinear. I tested your idea with blocks of 128 elements and I got 0.066 ratio which is not bad at all. You have comparisons available here but I did not yet add your idea. My intuition is that:
One more thing, I tested compression algorithms only on integers. On floating point, we may use other techniques than computing the delta, the gorilla paper talked about XOR:
We discovered that the value in most time series does not change significantly when compared to its neighboring data points. Further, many data sources only store integers. This allowed us to tune the expensive prediction scheme in [25] to a simpler implementation that merely compares the current value to the previous value. If values are close together the sign, exponent, and first few bits of the mantissa will be identical. We leverage this to compute a simple XOR of the current and previous values rather than employing a delta encoding scheme.
I added the frame of reference implementation in #1217 to compare the results
Currently, we serialize these values for each linear block:
data_start_offset
(u64)value_start_pos
(u64)positive_val_offset
(u64)slope
(f32)num_bits
u8It represents 29 bytes. But it's possible to remove 16 bytes.
Let's try to understand what we really need first and without bitpacking. We want to model some y values [y1, y2,...yn] as a linear function
y = a.x + b
But we want all values to be positive, so we find
b'
to satisfy this requirement and in the end, we have:y = a.x + b'
We know that our y values will not match the linear function but we will write the difference between the expected values and the real ones on the disk:
ywritten = yreal - (ax + b')
(1)To rebuild
yreal
, we only need to knowywritten
,b'
,a
. Thex
value is given by the caller so we don't need to write it.But we also want to bitpack
yw
values. For that we need to store additional information, thenum_bits
used for bitpacking. And that's all:data_start_offset
can be deduced by summingnum_bits * block_size
for each block before the block where we want to read. we can do it when we open the fast field file.value_start_pos
is not needed as we can see in (1).Is it worth saving 16 bytes?
Saving 16 bytes is not changing the world nor is a priority. But removing variables that we don't need is a good thing because it makes the code more readable.
What can we do?
For backward compatibility, I think we can add a new codec in the future release that will be more optimized than the current one.