Open westonpace opened 2 months ago
https://github.com/lancedb/lance/pull/2492/commits/5709bcd88bc4ed7a62fba355d85bd320785b20ed describes a potential fix
Also need https://github.com/lancedb/lance/pull/2492/commits/c59478be12953fbdbc358ffe9206f0d58c12ced4 and https://github.com/lancedb/lance/pull/2492/commits/7f866895f2901e6093c0000de957fec41d12caa3 (but this last one could use some tweaking, I think it creates too many sorted partition files)
When training an index the shuffling stage runs too slowly. This seems to be because we are storing 1 batch per partition (even empty partitions need a batch) and there is a lot of concatenation. By storing a list array we can avoid both problems and the shuffling can actually completely quite quickly (in theory as fast as 6 minutes for 1B rows but might need many CPUs / RAM to achieve). At the very least we should be able to finish in 1-2 hours.