Implement vector clustering based on BFS HNSW order

wjones127 commented 7 months ago

We would like to increase the chance that similar vectors are nearby in data files. One idea is to order first by IVF partition, and then again using depth-first-search order of HNSW index.

We should do benchmarking to validate this meaningfully improves query performance.

broccoliSpicy commented 3 months ago

can I work on this issue? @westonpace, @wjones127, @eddyxu, I may need some support

westonpace commented 3 months ago

@broccoliSpicy as far as I know, there is no one actively working on this. However, this will be a big issue. The table format currently has no concept of "clustering key".

We will need to store, in the metadata, what key we plan on using to cluster the data.
When new data arrives we will probably add the new data in the delta/wal/un-indexed section in the wrong order. Then, at compaction, we would need some way of inserting the data in the correct position. I honestly have no idea what this process looks like (though @wjones127 has been doing more research here)
If the user changes the cluster key we would need to rewrite the entire dataset (maybe this wouldn't even be supported)
The cluster key, in this case, is not just "cluster on column X" but "cluster on column X using this complex sorting algorithm". For example, if a user retrains their IVF index then the partitions will be destroyed and recreated. The data would all need to be rewritten.

I expect it will take months of work. There are other issues that may be more approachable. For example, I just created https://github.com/lancedb/lance/issues/2612 which is similar in topic but much smaller in scope. If sticking to this issue then, as a starting point, we are probably going to want some kind of clear design document describing the approach to take.

wjones127 commented 3 months ago

I also am not sure we will ever do this. We are looking at implementing incrementing primary keys (https://github.com/lancedb/lance/issues/2454). At which point, we'll generally cluster tables by that key. I think only after that should we start researching whether this is worth it. This is because clustering by something other than the primary key will cause secondary indices to become slower.

broccoliSpicy commented 3 months ago

Thanks for the feedback @westonpace @wjones127 ! I will try #2612 first.

lancedb / lance

Implement vector clustering based on BFS HNSW order #2064