lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.95k stars 219 forks source link

batch_sizes > 1024 are not respected by dataset.to_batches(...) #1778

Closed alexkohler closed 10 months ago

alexkohler commented 10 months ago

This could be user error, but in reading https://lancedb.github.io/lance/api/python/lance.html#lance.dataset.LanceDataset.to_batches, it doesn't look like there's any enforced limit. Behavior I'm seeing on lance 0.9.2 below:

for batch in dataset.to_batches(columns=["foo"], batch_size=100):
    print(len(batch)) # prints 100

for batch in dataset.to_batches(columns=["foo"], batch_size=1000):
    print(len(batch)) # prints 1000

for batch in dataset.to_batches(columns=["foo"], batch_size=10000):
    print(len(batch)) # prints 1024
westonpace commented 10 months ago

This is technically in alignment with pyarrow's dataset: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.

The batch_size parameter is "the max size of batches returned" and only slices batches if the underlying batch in the file is very large. Lance defaults to batches of size 1024.

It looks like we have it documented as The number of rows to fetch per batch. which is misleading. So we should at least update our docs.

alexkohler commented 10 months ago

gotcha - thanks for the speedy reply! Opened a PR to update the docs.