Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
For example, I have lots of images that are resized jpgs and gifs -- the hashes are technically different, but the vector l2 distance is tiny
then there's deduplicating things like images and watermarked images -- i also want this to be grouped together and to pick just one
which basically uses the vector distance again, but with a looser threshold
Problem Statement