lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.97k stars 224 forks source link

Use `max_bytes_per_file` in compaction planning #3139

Open oceanusxiv opened 3 days ago

oceanusxiv commented 3 days ago

In #2728 I think it was stated that there might be plans to utilize this param for compaction planning. I'm not sure if this was tracked anywhere here, so I would file this here since this would be a pretty nice enhancement for me.

Compaction is almost entirely IO bound so for tables with large row sizes using this param during planning can result in substantially higher throughput (since as it stands it's unlikely to plan for more than one compaction task when row sizes are large but number of rows is relatively small).