spiraldb / vortex

An extensible, state-of-the-art columnar file format
https://vortex.dev
Apache License 2.0
888 stars 21 forks source link

[bug] FastLanes signed integer array round trip failed #929

Closed XiangpengHao closed 4 weeks ago

XiangpengHao commented 4 weeks ago

Hi Vortex! I recently played a bit on Vortex <-> Arrow with FastLane encoding.

However, it seems that current implementation automatically convert signed interger to unsigned variant, as shown in the following code: https://github.com/spiraldb/vortex/blob/develop/encodings/fastlanes/src/bitpacking/compress.rs#L33-L40

Here's a minimal reproducible code, it will panic because the Int64Array is converted back to UInt64Array

use arrow::{
    array::{AsArray, Int64Array, UInt64Array},
    datatypes::{Int64Type, UInt64Type},
};
use vortex::{arrow::FromArrowArray, IntoCanonical};
use vortex_fastlanes::{find_best_bit_width, BitPackedArray};

fn main() {
    let vortex_array =
        vortex::Array::from_arrow(&Int64Array::from_iter(-500..500), false).as_primitive();

    let bitpacked_array = BitPackedArray::encode(
        vortex_array.as_ref(),
        find_best_bit_width(&vortex_array).unwrap(),
    )
    .unwrap();

    println!("bitpacked_array: {:?}", bitpacked_array);

    let canonical_array = bitpacked_array
        .into_canonical()
        .unwrap()
        .into_arrow()
        .unwrap();
    let arrow_array_again = canonical_array.as_primitive::<Int64Type>();
    println!("arrow_array_again: {:?}", arrow_array_again);
}

I tried to comment out the .to_unsigned() etc, and it seems to work again. I haven't read FastLane paper carefully, can you provide some hints on why to convert it into unsigned values?

lwwmanning commented 4 weeks ago

Thanks for finding this @XiangpengHao!

The bug is actually that BitPackedArray::encode is only supposed to work on unsigned integer inputs.

When compressing signed ints in the Vortex Compressor, we effectively always do FoR (Frame of Reference, which subtracts the minimum value and converts to unsigned) or ZigZag (which essentially does a wrapping shift left, converting negative signed ints to odd unsigned ints and positive signed ints to even unsigned ints).

Fix is in #930, will get that merged and cut a patch release :)

lwwmanning commented 4 weeks ago

fix released in 0.11.0