Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
the bytes value doesn't have much opportunity for encodings like bitpack.
pure ASCII, 8 bytes -> 7 bytes.
for the offsets, store the length rather than offsets. Since the max length in binary miniblock is less than 255, store it use a u8 instead of a u32.
storage we can save, say average binary length is 100, around 4% of data are offsets, around 3% saving.
say average binary length 20, around 20% of data are offsets, around 15% saving.
the bytes value doesn't have much opportunity for encodings like bitpack. pure ASCII, 8 bytes -> 7 bytes.
for the offsets, store the length rather than offsets. Since the max length in binary miniblock is less than 255, store it use a u8 instead of a u32.
storage we can save, say average binary length is 100, around 4% of data are offsets, around 3% saving. say average binary length 20, around 20% of data are offsets, around 15% saving.