Open wjones127 opened 7 months ago
can I work on this issue? @westonpace, @wjones127, @eddyxu, I may need some support
@broccoliSpicy as far as I know, there is no one actively working on this. However, this will be a big issue. The table format currently has no concept of "clustering key".
I expect it will take months of work. There are other issues that may be more approachable. For example, I just created https://github.com/lancedb/lance/issues/2612 which is similar in topic but much smaller in scope. If sticking to this issue then, as a starting point, we are probably going to want some kind of clear design document describing the approach to take.
I also am not sure we will ever do this. We are looking at implementing incrementing primary keys (https://github.com/lancedb/lance/issues/2454). At which point, we'll generally cluster tables by that key. I think only after that should we start researching whether this is worth it. This is because clustering by something other than the primary key will cause secondary indices to become slower.
Thanks for the feedback @westonpace @wjones127 ! I will try #2612 first.
We would like to increase the chance that similar vectors are nearby in data files. One idea is to order first by IVF partition, and then again using depth-first-search order of HNSW index.
We should do benchmarking to validate this meaningfully improves query performance.