lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.82k stars 212 forks source link

Better handle schema mismatch when writing dataset #1151

Open chebbyChefNEQ opened 1 year ago

chebbyChefNEQ commented 1 year ago
import pyarrow as pa
import lance

t = pa.Table.from_pylist([{"test": [1, 2, 3]}])
ds = lance.write_dataset(
    t,
    "test.lance",
    schema=pa.schema([pa.field("test", pa.list_(pa.float32(), 32))]),
    mode="overwrite",
)
print(f"===> schema with pa.Table: \n{t.schema}, lance: \n{ds.schema}")

batch = pa.RecordBatch.from_pylist([{"test": [1, 2, 3]}])
ds = lance.write_dataset(
    pa.RecordBatch.from_pylist([{"test": [1, 2, 3]}]),
    "test.lance",
    schema=pa.schema([pa.field("test", pa.list_(pa.float32(), 32))]),
    mode="overwrite",
)
print(f"===> schema with pa.RecordBatch: \n{batch.schema}, lance: \n{ds.schema}")

lance.write_dataset(
    [pa.RecordBatch.from_pylist([{"test": [1, 2, 3]}])],
    "test.lance",
    schema=pa.schema([pa.field("test", pa.list_(pa.float32(), 32))]),
    mode="overwrite",
)

the above script yields output:

===> schema with pa.Table:
test: list<item: int64>
  child 0, item: int64, lance:
test: list<item: int64>
  child 0, item: int64
===> schema with pa.RecordBatch:
test: list<item: int64>
  child 0, item: int64, lance:
test: list<item: int64>
  child 0, item: int64
munmap_chunk(): invalid pointer
[1]    472488 IOT instruction (core dumped)  python test.py

In the first two cases, maybe we should print a warning about setting schema= when the data source is pa.Table | pa.RecordBatch, where the specified schema is ignored.

In the third case, we crash on abort because the list of record batch and schema mismatches. (ideally user would call with RBR, but sometime just passing a list of RecordBatches is convenient.0

wjones127 commented 1 year ago
pa.RecordBatch.from_pylist([{"test": [1, 2, 3]}])

produces a variable size list of int64, so it would need to be cast to a fixed-size list of float32. We do the casting for iterables. There isn't a kernel for casting variable to fixed size list in PyArrow, which is why the last one is erroring. And the current bug in arrow-rs means an exception raised in a RBR is a segfault.