During the investigation of https://github.com/lancedb/lancedb/issues/1539, it turns out that lance already support pyarrow.compute.Expression, however Polars by default set string column to pa.large_string, which prevents us to push arrow filter via Polars.

How to Preproduce

import pyarrow as pa
import lance

schema = pa.schema({"x": pa.int64(), "str": pa.large_string()})
table = pa.table(
    {"x": range(10), "str": [f"str-{i}" for i in range(10)]}, schema=schema
)

ds = lance.write_dataset(table, tmp_path)

print(pc.field("str").isin(["str-2", "str-3"]))
batch = list(ds.to_batches(filter=pc.field("str").isin(["str-2", "str-3"])))[0]
print(batch)

With error

    def to_reader(self) -> pa.RecordBatchReader:
>       return self._scanner.to_pyarrow()
E       ValueError: Invalid user input: pushdown filter referenced a field that is not yet supported by Substrait conversion, /Users/lei/work/lance/rust/lance-datafusion/src/substrait.rs:227:157

lancedb / lance

bug: substrait does not work with large string #3133

How to Preproduce