lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.94k stars 216 forks source link

Allow users to set a distance threshold to consider two vectors a "very similiar" #875

Open eddyxu opened 1 year ago

eddyxu commented 1 year ago

Problem Statement

For example, I have lots of images that are resized jpgs and gifs -- the hashes are technically different, but the vector l2 distance is tiny then there's deduplicating things like images and watermarked images -- i also want this to be grouped together and to pick just one which basically uses the vector distance again, but with a looser threshold

changhiskhan commented 1 year ago

Help wanted: if anyone can do a simple POC that determines what the optimal technique is, we can implement the rest in Rust