lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.76k stars 206 forks source link

PQ storage should be sorted by row ID #2612

Open westonpace opened 1 month ago

westonpace commented 1 month ago

When we store PQ data (e.g. in an IVF_PQ index) I do not believe it is sorted by row ID. However, there is no downside to doing so, we just need to sort before we write during the shuffling process.

If we can do that, and guarantee that the PQ data is sorted by row ID, then we can speed up prefiltering. The slowest part of prefiltering currently is not the secondary index search to get the matching row IDs. The slowest part is actually applying that prefilter to the PQ search. This is because we have to sort the PQ data by row ID and then filter it. If the PQ data is already sorted we can skip that step and speed up prefiltering.

westonpace commented 1 month ago

@broccoliSpicy some guidance if you'd like to tackle this.

IVF_PQ indices are currently created in rust/lance/src/index/vector/ivf/builder.rs in the build_partitions method. In this method we:

Then in shuffle_dataset we:

So the actual change needed by this issue would happen in that final step (write out the final sorted file). This actually happens in write_pq_partitions which is in rust/lance/src/index/vector/ivf/io.rs.

broccoliSpicy commented 1 month ago

@westonpace Thank you so much for the guidance!