lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.74k stars 204 forks source link

LanceFragment.create does not coerce input to the target schema #2549

Closed jiachengdb closed 1 month ago

jiachengdb commented 1 month ago

Here is a repro on pylance 0.13.0.

import pyarrow as pa
import lance
from lance.fragment import LanceFragment, FragmentMetadata
import shutil

uri = '/tmp/examples.lance'
shutil.rmtree(uri, ignore_errors=True)

arrow_schema = pa.schema(
            [
                pa.field('id', pa.string()),
                pa.field(
                    'value', pa.int32()
                ),
            ]
        )

reader = pa.table({"id": ["1", "2"], "value": [100, 200]})

fragment = LanceFragment.create(uri, data=reader, schema=arrow_schema)

import pandas as pd

data = {'fragment_json': [fragment._metadata.json()]}
df = pd.DataFrame(data)

fragments = [FragmentMetadata.from_json(fragment_json) for fragment_json in df['fragment_json']]

operation = lance.LanceOperation.Overwrite(arrow_schema, fragments=fragments)
dataset = lance.LanceDataset.commit(uri, operation)

print(dataset.to_table().to_pandas().to_string())

This will output

 id  value
0  1    100
1  2      0

If I change the type of the "value" column to int64, the result is correct.

The bug is that LanceFragment.create neither checks the input data conforms to the target schema (it should fail if the schema doesn't match), nor does it coerce the input data. The current behavior is bad because it silently corrupts the data.

fragment = LanceFragment.create(uri, data=reader, schema=arrow_schema)
wjones127 commented 1 month ago

This is true, and intended behavior. For this kind of low-level API it's hard to performantly do all the safety checks we do for high level APIs. Hence the warning we have in the docs:

This is an advanced API and doesn’t provide the same level of validation as the other APIs. For example, it’s the responsibility of the caller to ensure that the fragments are valid for the schema.

(From https://lancedb.github.io/lance/api/python/lance.html#lance.dataset.LanceDataset.commit)

jiachengdb commented 1 month ago

Yeah, i think it would be nice to add a sanity check for the schema and reject the writes.

changhiskhan commented 1 month ago

@jiachengdb for now we won't plan to address this issue at this API level. If this is a big pain-point, let's discuss together whether we can add something to the higher level lancedb api?