lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.63k stars 193 forks source link

Apache Arrow does not support FieldRef to list of structs #60

Closed eddyxu closed 1 year ago

eddyxu commented 1 year ago

Problem Statement

Apache Arrow does not support field reference to a list<struct>


import duckdb

ds = lance.dataset("./coco.lance").scanner(columns=["id", "annotations.label"])

Error:

Traceback (most recent call last):
  File "/Users/lei/work/lance/./query.py", line 6, in <module>
    ds = lance.dataset("./coco.lance").scanner(columns=["id", "annotations.label"])
  File "pyarrow/_dataset.pyx", line 271, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 2328, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2174, in pyarrow._dataset._populate_builder
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(annotations.label) in id: int64
width: int64
height: int64
file_name: string
image: struct<data: binary>
annotations: list<item: struct<area: double, box: struct<xmax: double, xmin: double, ymax: double, ymin: double>, label: string, label_id: int64, segmentation: struct<height: int64, polygon: list<item: list<item: double>>, rle: list<item: int64>, type: int64, width: int64>, supercategory: string>>
__index_level_0__: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

Expected Behavior

Using annotations.label should returns values with type list<struct<label: str>> , a subset view of the original annotations list<struct>

eddyxu commented 1 year ago

Reported in Arrow's JIRA https://issues.apache.org/jira/browse/ARROW-17540

changhiskhan commented 1 year ago

@eddyxu we control this now don't we? i think we could make list-of-struct a lot easier to work with?