PQ storage should be sorted by row ID

westonpace commented 1 month ago

When we store PQ data (e.g. in an IVF_PQ index) I do not believe it is sorted by row ID. However, there is no downside to doing so, we just need to sort before we write during the shuffling process.

If we can do that, and guarantee that the PQ data is sorted by row ID, then we can speed up prefiltering. The slowest part of prefiltering currently is not the secondary index search to get the matching row IDs. The slowest part is actually applying that prefilter to the PQ search. This is because we have to sort the PQ data by row ID and then filter it. If the PQ data is already sorted we can skip that step and speed up prefiltering.

westonpace commented 1 month ago

@broccoliSpicy some guidance if you'd like to tackle this.

IVF_PQ indices are currently created in rust/lance/src/index/vector/ivf/builder.rs in the build_partitions method. In this method we:

Take data as input which is a stream of all the vectors in the dataset
Create an ivf_transformer which will compute centroid residuals, quantize the vectors, etc.
Call shuffle_dataset

Then in shuffle_dataset we:

Scan the transformed data and write it to a file (the unsorted file)
Partition the unsorted file into some number of jobs
For each job, we read that partition, then convert it into a Vec<Vec> (outer vec has one entry per partition. Inner vec contains all batches for that partition)
Write out a sorted file for each job
Scan the jobs
Write out the final sorted file

So the actual change needed by this issue would happen in that final step (write out the final sorted file). This actually happens in write_pq_partitions which is in rust/lance/src/index/vector/ivf/io.rs.

broccoliSpicy commented 1 month ago

@westonpace Thank you so much for the guidance!

lancedb / lance

PQ storage should be sorted by row ID #2612