Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
I have a very large dataset stored on s3 (>1 billion rows, 1024 dims), and I'm getting an OOM from running to_batches() (64gb ram).
I'm just running:
for batch in tqdm(dataset.to_batches(batch_size=1024)):
pass
There is some non-determinism in how many iterations it takes fwiw. This OOMs around 4000\~5000 iterations though (which, even if it was materializing, should only be about 16\~20gb of ram).
I have a very large dataset stored on s3 (>1 billion rows, 1024 dims), and I'm getting an OOM from running to_batches() (64gb ram). I'm just running:
There is some non-determinism in how many iterations it takes fwiw. This OOMs around 4000\~5000 iterations though (which, even if it was materializing, should only be about 16\~20gb of ram).