Open tonyf opened 2 months ago
Try split == 'test'
Same issue
I'm not sure if it does what you're expecting from .filter
, but ds.to_table(filter="split == 'test'")
should work if split
is a column.
Originally was trying to solve this https://github.com/lancedb/lance/issues/2778 when I ran into this issue.
Building a custom torch dataloader for a 10TB dataset so materializing it isn't an option.
I get the same error. Putting a simple repro here:
import lance
import pyarrow as pa
table = pa.Table.from_pylist([{"name": "Alice", "age": 20},
{"name": "Bob", "age": 30}])
lance.write_dataset(table, "./alice_and_bob.lance", mode="overwrite")
ds = lance.dataset("./alice_and_bob.lance")
ds.filter("age == 30")
As far as I can tell, the above is intended use according to the api spec.
Hmm, yes. We extend pyarrow.dataset.Dataset
because we want to appear as a pyarrow dataset since there is no dataset protocol at the moment. E.g. this is how DuckDb is able to query us (it thinks we are a pyarrow dataset).
In this case it looks like pyarrow has this function: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.filter
We are not overloading it and so it is falling back to the underlying pyarrow impl (which isn't meant to be used). We should overload this method and provide some kind of implementation.
The same probably goes for sort_by
, join
(we should nicely report it isn't supported), join_asof
(same, not supported), and replace_schema
(again, not supported).
Looks like filter is broken?
Repro:
Looks like in that pyarrow file,
self._scan_options
is NoneVersion:
Is this a version issue? I tried downgrading to pyarrow=='12.0.0' but am still running into the error