Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
The repetition index is a general purpose structure that is used in the following situations:
Miniblock encoding when there are repetition levels (in this case the index will store chunk offsets)
Zipped encoding when there are repetition levels OR when the data is variable-width (the index will store byte offsets)
Either miniblock or zipped encoding when there is RLE (this will be added much later)
The repetition index is not read during full scans. However, it is read during a partial scan of a page. The repetition index introduces "indirect I/O" back into the 2.1 format ( :melting_face: )
The repetition index is a general purpose structure that is used in the following situations:
The repetition index is not read during full scans. However, it is read during a partial scan of a page. The repetition index introduces "indirect I/O" back into the 2.1 format ( :melting_face: )