lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.91k stars 215 forks source link

AttributeError: 'NoneType' object has no attribute 'get' on `dataset.filter` #2777

Open tonyf opened 2 months ago

tonyf commented 2 months ago

Looks like filter is broken?

Repro:

import lance

ds = lance.dataset(path)
ds.filter("split = 'test'")

>>>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 796, in pyarrow._dataset.Dataset.filter
AttributeError: 'NoneType' object has no attribute 'get'

Looks like in that pyarrow file, self._scan_options is None

current_filter = self._scan_options.get("filter")

Version:

Is this a version issue? I tried downgrading to pyarrow=='12.0.0' but am still running into the error

jacketsj commented 2 months ago

Try split == 'test'

tonyf commented 2 months ago

Same issue

jacketsj commented 2 months ago

I'm not sure if it does what you're expecting from .filter, but ds.to_table(filter="split == 'test'") should work if split is a column.

tonyf commented 2 months ago

Originally was trying to solve this https://github.com/lancedb/lance/issues/2778 when I ran into this issue.

Building a custom torch dataloader for a 10TB dataset so materializing it isn't an option.

jacketsj commented 2 months ago

I get the same error. Putting a simple repro here:

import lance
import pyarrow as pa

table = pa.Table.from_pylist([{"name": "Alice", "age": 20},
                              {"name": "Bob", "age": 30}])
lance.write_dataset(table, "./alice_and_bob.lance", mode="overwrite")
ds = lance.dataset("./alice_and_bob.lance")
ds.filter("age == 30")

As far as I can tell, the above is intended use according to the api spec.

westonpace commented 2 months ago

Hmm, yes. We extend pyarrow.dataset.Dataset because we want to appear as a pyarrow dataset since there is no dataset protocol at the moment. E.g. this is how DuckDb is able to query us (it thinks we are a pyarrow dataset).

In this case it looks like pyarrow has this function: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.filter

We are not overloading it and so it is falling back to the underlying pyarrow impl (which isn't meant to be used). We should overload this method and provide some kind of implementation.

The same probably goes for sort_by, join (we should nicely report it isn't supported), join_asof (same, not supported), and replace_schema (again, not supported).