lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.93k stars 215 forks source link

Implement vector clustering based on BFS HNSW order #2064

Open wjones127 opened 7 months ago

wjones127 commented 7 months ago

We would like to increase the chance that similar vectors are nearby in data files. One idea is to order first by IVF partition, and then again using depth-first-search order of HNSW index.

We should do benchmarking to validate this meaningfully improves query performance.

broccoliSpicy commented 3 months ago

can I work on this issue? @westonpace, @wjones127, @eddyxu, I may need some support

westonpace commented 3 months ago

@broccoliSpicy as far as I know, there is no one actively working on this. However, this will be a big issue. The table format currently has no concept of "clustering key".

I expect it will take months of work. There are other issues that may be more approachable. For example, I just created https://github.com/lancedb/lance/issues/2612 which is similar in topic but much smaller in scope. If sticking to this issue then, as a starting point, we are probably going to want some kind of clear design document describing the approach to take.

wjones127 commented 3 months ago

I also am not sure we will ever do this. We are looking at implementing incrementing primary keys (https://github.com/lancedb/lance/issues/2454). At which point, we'll generally cluster tables by that key. I think only after that should we start researching whether this is worth it. This is because clustering by something other than the primary key will cause secondary indices to become slower.

broccoliSpicy commented 3 months ago

Thanks for the feedback @westonpace @wjones127 ! I will try #2612 first.