spiraldb / vortex

A toolkit for working with compressed Arrow in-memory, on-disk, and over-the-wire
Apache License 2.0
92 stars 5 forks source link

FOR encoding a sorted array doesn't preserve sortedness #400

Closed robert3005 closed 1 week ago

robert3005 commented 2 weeks ago

FOR encoding has an edge case where if you have values close to both ends of the type range you might not be able to represent encoded values in the same dtype since they will exceed max value. This can happen in particular when you have values close to MIN and MAX of the dtype. In order to solve this issue we currently perform wrapping subtraction so values too big end up being shifted to the end of the range.

Example for i5 dtype

[-10, -8, 10] -> subtract min (-10) -> [0, 2, -12]

Now the converted array is no longer sorted.

FOR encoding is done so we can shift all the signed integers into unsigned space which works with wrapping subtraction, however, it's unclear that it's the most practical choice. The cases where you'd run into the issue will not compress well further since you have values with very few leading zeros (1?) and we seem to lose some guarantees that are desired for our compression codecs. I suggest that we refuse to FOR and then Bitpack such arrays and let compressor choose a different encoding.

robert3005 commented 2 weeks ago

I think if we were to change the dtypes of FOR array from signed to unsigned then everything still works