lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.97k stars 224 forks source link

compression in binary miniblock #3153

Open broccoliSpicy opened 9 hours ago

broccoliSpicy commented 9 hours ago

the bytes value doesn't have much opportunity for encodings like bitpack. pure ASCII, 8 bytes -> 7 bytes.

for the offsets, store the length rather than offsets. Since the max length in binary miniblock is less than 255, store it use a u8 instead of a u32.

storage we can save, say average binary length is 100, around 4% of data are offsets, around 3% saving. say average binary length 20, around 20% of data are offsets, around 15% saving.