OOM from to_batches() on s3-stored dataset

lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

https://lancedb.github.io/lance/

Apache License 2.0

3.83k stars 212 forks source link

OOM from to_batches() on s3-stored dataset #2960

Closed jacketsj closed 2 hours ago

jacketsj commented 3 hours ago

I have a very large dataset stored on s3 (>1 billion rows, 1024 dims), and I'm getting an OOM from running to_batches() (64gb ram). I'm just running:

for batch in tqdm(dataset.to_batches(batch_size=1024)):
    pass

There is some non-determinism in how many iterations it takes fwiw. This OOMs around 4000\~5000 iterations though (which, even if it was materializing, should only be about 16\~20gb of ram).

jacketsj commented 2 hours ago

Version error (on me).