Closed alexkohler closed 10 months ago
This is technically in alignment with pyarrow's dataset: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches
batch_sizeint, default 131_072
The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
The batch_size
parameter is "the max size of batches returned" and only slices batches if the underlying batch in the file is very large. Lance defaults to batches of size 1024.
It looks like we have it documented as The number of rows to fetch per batch.
which is misleading. So we should at least update our docs.
gotcha - thanks for the speedy reply! Opened a PR to update the docs.
This could be user error, but in reading https://lancedb.github.io/lance/api/python/lance.html#lance.dataset.LanceDataset.to_batches, it doesn't look like there's any enforced limit. Behavior I'm seeing on lance 0.9.2 below: