lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
Apache License 2.0
3.63k stars 193 forks source link

Apache Arrow does not support FieldRef to list of structs #60

Closed eddyxu closed 1 year ago

eddyxu commented 1 year ago

Problem Statement

Apache Arrow does not support field reference to a list<struct>

import duckdb

ds = lance.dataset("./coco.lance").scanner(columns=["id", "annotations.label"])


Traceback (most recent call last):
  File "/Users/lei/work/lance/./", line 6, in <module>
    ds = lance.dataset("./coco.lance").scanner(columns=["id", "annotations.label"])
  File "pyarrow/_dataset.pyx", line 271, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 2328, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2174, in pyarrow._dataset._populate_builder
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(annotations.label) in id: int64
width: int64
height: int64
file_name: string
image: struct<data: binary>
annotations: list<item: struct<area: double, box: struct<xmax: double, xmin: double, ymax: double, ymin: double>, label: string, label_id: int64, segmentation: struct<height: int64, polygon: list<item: list<item: double>>, rle: list<item: int64>, type: int64, width: int64>, supercategory: string>>
__index_level_0__: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

Expected Behavior

Using annotations.label should returns values with type list<struct<label: str>> , a subset view of the original annotations list<struct>

eddyxu commented 1 year ago

Reported in Arrow's JIRA

changhiskhan commented 1 year ago

@eddyxu we control this now don't we? i think we could make list-of-struct a lot easier to work with?