perf: we may have a chance to avoid some copies when decoding with some alignment adjustment(page alignment, chunk alignment)

lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

https://lancedb.github.io/lance/

Apache License 2.0

3.97k stars 224 forks source link

perf: we may have a chance to avoid some copies when decoding with some alignment adjustment(page alignment, chunk alignment) #3115

Open broccoliSpicy opened 1 week ago

broccoliSpicy commented 1 week ago

PR #3101 added alignment in page layout and chunk layout, but PR #3099 still need to do a copy of the raw data read from disk to start decoding, see the code here

for fastlanes bitpacking, there is also a copy https://github.com/lancedb/lance/blob/c237bcb9318d30cf382aecd56b673aae85b2c555/rust/lance-encoding/src/encodings/physical/bitpack_fastlanes.rs#L1727

in binary + miniblock, I think this copy can be avoided if we align each chunk to 4 bytes. in fastlanes bitpacking, because the use of SIMD instruction, the alignment requirement is stronger, reference here, we may also need to change the compression logic to allow it

westonpace commented 1 week ago

in binary + miniblock, I think this copy can be avoided if we align each chunk to 4 bytes.

We already align each chunk to 8 bytes but there was a bug preventing this from working correctly.

in fastlanes bitpacking, because the use of SIMD instruction, the alignment requirement is stronger

How much stronger? 64 byte alignment for a 4KiB chunk seems pretty extensive (up to 6% of the block is wasted space) but I suppose it is manageable. Still, if we want to require this then I'd prefer changing the MiniBlockCompressor trait to allow compressors to state how much alignment they need. This way compressors that don't need such strict requirements don't have to pay.

westonpace commented 1 week ago

(up to 6% of the block is wasted space)

I did the math wrong. 64/4096 is 1/64th so more like 1-2%.

broccoliSpicy commented 3 days ago

https://github.com/lancedb/lance/blob/c237bcb9318d30cf382aecd56b673aae85b2c555/rust/lance-encoding/src/encodings/physical/bitpack_fastlanes.rs#L1727 instantiate a copy for the data using .to_vec doesn't guarantee the data starts at a 64 byte aligned position either, we may be able to get rid of this copy with the currently page layout padding, if so, changes in bitpack mini-block compression logic needed.