Open westonpace opened 1 month ago
@broccoliSpicy some guidance if you'd like to tackle this.
IVF_PQ indices are currently created in rust/lance/src/index/vector/ivf/builder.rs
in the build_partitions
method. In this method we:
data
as input which is a stream of all the vectors in the datasetivf_transformer
which will compute centroid residuals, quantize the vectors, etc.shuffle_dataset
Then in shuffle_dataset
we:
So the actual change needed by this issue would happen in that final step (write out the final sorted file). This actually happens in write_pq_partitions
which is in rust/lance/src/index/vector/ivf/io.rs
.
@westonpace Thank you so much for the guidance!
When we store PQ data (e.g. in an IVF_PQ index) I do not believe it is sorted by row ID. However, there is no downside to doing so, we just need to sort before we write during the shuffling process.
If we can do that, and guarantee that the PQ data is sorted by row ID, then we can speed up prefiltering. The slowest part of prefiltering currently is not the secondary index search to get the matching row IDs. The slowest part is actually applying that prefilter to the PQ search. This is because we have to sort the PQ data by row ID and then filter it. If the PQ data is already sorted we can skip that step and speed up prefiltering.