lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.97k stars 224 forks source link

bug: substrait does not work with large string #3133

Open eddyxu opened 5 days ago

eddyxu commented 5 days ago

During the investigation of https://github.com/lancedb/lancedb/issues/1539, it turns out that lance already support pyarrow.compute.Expression, however Polars by default set string column to pa.large_string, which prevents us to push arrow filter via Polars.

How to Preproduce

import pyarrow as pa
import lance

schema = pa.schema({"x": pa.int64(), "str": pa.large_string()})
table = pa.table(
    {"x": range(10), "str": [f"str-{i}" for i in range(10)]}, schema=schema
)

ds = lance.write_dataset(table, tmp_path)

print(pc.field("str").isin(["str-2", "str-3"]))
batch = list(ds.to_batches(filter=pc.field("str").isin(["str-2", "str-3"])))[0]
print(batch)

With error

    def to_reader(self) -> pa.RecordBatchReader:
>       return self._scanner.to_pyarrow()
E       ValueError: Invalid user input: pushdown filter referenced a field that is not yet supported by Substrait conversion, /Users/lei/work/lance/rust/lance-datafusion/src/substrait.rs:227:157