Open keunhong opened 2 months ago
What data type is column_name
? You say you are using the V1 file format, which doesn't support nulls for most data types, but are filtering on IS NOT NULL
.
I've attempted to repro here with a string column (where we do support nulls), but I seem to be getting correct answers here:
import lance
import pyarrow as pa
data = pa.table({
'id': ['a', 'b', None, 'c', 'd', None, 'e']
})
ds = lance.write_dataset(data, 'test', mode='overwrite')
scan_batch_size = ds.scanner(
columns=['id'],
filter="id IS NOT NULL",
with_row_id=False,
batch_size=3,
)
print(scan_batch_size.explain_plan())
scan_batch_size.to_table()
ProjectionExec: expr=[id@0 as id]
FilterExec: id@0 IS NOT NULL
LanceScan: uri=Users/willjones/Documents/notebooks/test/data, projection=[id], row_id=true, row_addr=false, ordered=true
pyarrow.Table
id: string
----
id: [["a","b"],["c","d"],["e"]]
scan_no_batch_size = ds.scanner(
columns=['id'],
filter="id IS NOT NULL",
with_row_id=False,
batch_size=None,
)
print(scan_no_batch_size.explain_plan())
scan_no_batch_size.to_table()
LancePushdownScan: uri=Users/willjones/Documents/notebooks/test/data, projection=[id], predicate=id IS NOT NULL, row_id=false, row_addr=false, ordered=true
pyarrow.Table
id: string
----
id: [["a","b","c","d","e"]]
scan_no_row_id = ds.scanner(
columns=['id'],
filter="id IS NOT NULL",
with_row_id=False,
batch_size=None,
)
print(scan_no_row_id.explain_plan())
scan_no_row_id.to_table()
LancePushdownScan: uri=Users/willjones/Documents/notebooks/test/data, projection=[id], predicate=id IS NOT NULL, row_id=false, row_addr=false, ordered=true
pyarrow.Table
id: string
----
id: [["a","b","c","d","e"]]
It is a byte column so I suppose the fact that it works is just a coincidence then.
Hello!
We are experiencing differing behavior of
to_batches
depending on whether we set thebatch_size
or not.When using a filter and
with_row_id=True
, we get the correct behavior if we set the batch size:But if we don't set the batch size then it ignores the filter and just returns all the rows:
We are using
pylance==0.16.0
. The dataset is using the V1 file format.