lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.78k stars 207 forks source link

Shuffling stage is too slow #2519

Open westonpace opened 2 months ago

westonpace commented 2 months ago

When training an index the shuffling stage runs too slowly. This seems to be because we are storing 1 batch per partition (even empty partitions need a batch) and there is a lot of concatenation. By storing a list array we can avoid both problems and the shuffling can actually completely quite quickly (in theory as fast as 6 minutes for 1B rows but might need many CPUs / RAM to achieve). At the very least we should be able to finish in 1-2 hours.

westonpace commented 2 months ago

https://github.com/lancedb/lance/pull/2492/commits/5709bcd88bc4ed7a62fba355d85bd320785b20ed describes a potential fix

westonpace commented 2 months ago

Also need https://github.com/lancedb/lance/pull/2492/commits/c59478be12953fbdbc358ffe9206f0d58c12ced4 and https://github.com/lancedb/lance/pull/2492/commits/7f866895f2901e6093c0000de957fec41d12caa3 (but this last one could use some tweaking, I think it creates too many sorted partition files)