batch_sizes > 1024 are not respected by dataset.to_batches(...)

alexkohler commented 10 months ago

This could be user error, but in reading https://lancedb.github.io/lance/api/python/lance.html#lance.dataset.LanceDataset.to_batches, it doesn't look like there's any enforced limit. Behavior I'm seeing on lance 0.9.2 below:

for batch in dataset.to_batches(columns=["foo"], batch_size=100):
    print(len(batch)) # prints 100

for batch in dataset.to_batches(columns=["foo"], batch_size=1000):
    print(len(batch)) # prints 1000

for batch in dataset.to_batches(columns=["foo"], batch_size=10000):
    print(len(batch)) # prints 1024

westonpace commented 10 months ago

This is technically in alignment with pyarrow's dataset: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches

batch_sizeint, default 131_072

The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.

The batch_size parameter is "the max size of batches returned" and only slices batches if the underlying batch in the file is very large. Lance defaults to batches of size 1024.

It looks like we have it documented as The number of rows to fetch per batch. which is misleading. So we should at least update our docs.

alexkohler commented 10 months ago

gotcha - thanks for the speedy reply! Opened a PR to update the docs.

lancedb / lance

batch_sizes > 1024 are not respected by dataset.to_batches(...) #1778