refactor: remove the queue in LanceArrowWriter to reduce memory usage for spark sink

lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

https://lancedb.github.io/lance/

Apache License 2.0

3.97k stars 224 forks source link

refactor: remove the queue in LanceArrowWriter to reduce memory usage for spark sink #3110

Closed SaintBacchus closed 1 week ago

SaintBacchus commented 1 week ago

Remove the queue in LanceArrowWriter since it may cache all rows in queue and that will require a lot of jvm memory.

Use mutex to control the write rate of sinker. Writer will wait util the reader take the batch.

And more I had moved the maven-shade-plugin into a new profile which is diabled by default because jar-with-dependencie was conflict with many jars in spark dependencie