lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.92k stars 217 forks source link

`LanceFragment.to_batches` not respecting `filter` kwarg #2778

Open tonyf opened 2 months ago

tonyf commented 2 months ago

It looks like to_batches isn't respecting the filter kwarg

Repro

import lance

ds = lance.dataset(path)
fragments = ds.get_fragments()

for batch in fragments[0].to_batches(
    batch_size=1, 
    filter="split == 'test'", 
    columns=["image", "split"], 
    with_row_id=True, 
    batch_readahead=8,
):
    break

print(batch)
>>>
pyarrow.RecordBatch
image: binary
split: string
----
image: ...
split: ["train"]
tonyf commented 2 months ago

This seems to happen only under the legacy storage format. In stable this seems to be working correctly atm