lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.97k stars 224 forks source link

AttributeError: attribute 'schema' of 'pyarrow.lib.Table' objects is not writable #3108

Closed Jay-ju closed 6 days ago

Jay-ju commented 1 week ago

When I write the following code, there will be an issue with the metadata.

ds = ray.data.read_webdataset(
    paths=input_dir,
    filesystem=TOS_FS,
    suffixes=FILE_TYPES,
    concurrency=1,
    output_format='pandas',
    decoder=None,
    expand_json=True
)

sink = LanceDatasink(
    output_dir,
    schema=required_schema,
    max_rows_per_file=1000,
    storage_options=storage_options,
    mode="overwrite")

ds.write_datasink(sink, concurrency=1)

The specific stack trace is as follows:

yield from self._block_fn(input, ctx)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/data00/code/ray/python/ray/data/_internal/planner/plan_write_op.py", line 26, in fn write_result = datasink_or_legacy_datasource.write(blocks, ctx) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data00/miniconda3/envs/ray_311/lib/python3.11/site-packages/lance/ray/sink.py", line 246, in write fragments_and_schema = _write_fragment( ^^^^^^^^^^^^^^^^ File "/data00/miniconda3/envs/ray_311/lib/python3.11/site-packages/lance/ray/sink.py", line 88, in _write_fragment fragments = write_fragments( ^^^^^^^^^^^^^^^^ File "/data00/miniconda3/envs/ray_311/lib/python3.11/site-packages/lance/fragment.py", line 614, in write_fragments fragments = _write_fragments( ^^^^^^^^^^^^^^^^^ OSError: LanceError(Arrow): C Data interface error: Unknown error: attribute 'schema' of 'pyarrow.lib.Table' objects is not writable. Detail: Python exception: Traceback (most recent call last): File "/data00/miniconda3/envs/ray_311/lib/python3.11/site-packages/lance/ray/sink.py", line 79, in record_batch_converter tbl = _pd_to_arrow(block, schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data00/miniconda3/envs/ray_311/lib/python3.11/site-packages/lance/ray/sink.py", line 45, in _pd_to_arrow tbl.schema = tbl.schema.remove_metadata() ^^^^^^^^^^ AttributeError: attribute 'schema' of 'pyarrow.lib.Table' objects is not writable , /data00/code/lance/rust/lance-datafusion/src/utils.rs:41:28

eddyxu commented 6 days ago

Merged. Thanks @Jay-ju !