Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
During the investigation of https://github.com/lancedb/lancedb/issues/1539, it turns out that lance already support pyarrow.compute.Expression, however Polars by default set string column to pa.large_string, which prevents us to push arrow filter via Polars.
How to Preproduce
import pyarrow as pa
import lance
schema = pa.schema({"x": pa.int64(), "str": pa.large_string()})
table = pa.table(
{"x": range(10), "str": [f"str-{i}" for i in range(10)]}, schema=schema
)
ds = lance.write_dataset(table, tmp_path)
print(pc.field("str").isin(["str-2", "str-3"]))
batch = list(ds.to_batches(filter=pc.field("str").isin(["str-2", "str-3"])))[0]
print(batch)
With error
def to_reader(self) -> pa.RecordBatchReader:
> return self._scanner.to_pyarrow()
E ValueError: Invalid user input: pushdown filter referenced a field that is not yet supported by Substrait conversion, /Users/lei/work/lance/rust/lance-datafusion/src/substrait.rs:227:157
During the investigation of https://github.com/lancedb/lancedb/issues/1539, it turns out that lance already support
pyarrow.compute.Expression
, howeverPolars
by default set string column topa.large_string
, which prevents us to push arrow filter via Polars.How to Preproduce
With error