lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.58k stars 184 forks source link

PQ assignment stage `HashMap` lookups are too slow at 1B rows #2518

Open westonpace opened 2 weeks ago

westonpace commented 2 weeks ago

We don't need a HashMap anyways. Since we're mapping from row address to partition id we can use a Vec<Vec<...>> where each lookup is map[fragment_id][row_offset]. From some experimentation this is ~6x faster.

westonpace commented 2 weeks ago

https://github.com/lancedb/lance/pull/2492/commits/860d0028bf3d60d826270ecb682f8c5d48ca81d4 demonstrates the fix