jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 222 forks source link

Incorrect length of MapArray causes read panic #1409

Open kjschiroo opened 1 year ago

kjschiroo commented 1 year ago

Generating a file with a single map column using the python snippet below causes parquet_read to fail on the latest version of the main branch (653d57ad3)

import pyarrow as pa
import pyarrow.parquet

print(f"pyarrow {pa.__version__}")
# pyarrow 11.0.0

table = pa.Table.from_pydict({"my_column": pa.array([{"foo": 123}, {"foo": 321}], pa.map_(pa.string(), pa.uint64()))})
with open("sample.parquet", "wb") as f:
    pa.parquet.write_table(table=table, where=f, version="2.6", data_page_version="2.0", compression="SNAPPY")

Attempting to read it yields:

$ RUST_BACKTRACE=1 cargo run  --features io_parquet,io_parquet_compression --example parquet_read sample.parquet 
   Compiling arrow2 v0.16.0 (/home/kjschiroo/Desktop/arrow2)
    Finished dev [unoptimized + debuginfo] target(s) in 15.78s
     Running `target/debug/examples/parquet_read sample.parquet`
Statistics {
    null_count: MapArray[[{key: 0, value: 0}]],
    distinct_count: MapArray[[{key: None, value: None}]],
    min_value: MapArray[[{key: foo, value: 123}]],
    max_value: MapArray[[{key: foo, value: 321}]],
}
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/io/parquet/read/row_group.rs:69:37
stack backtrace:
   0: rust_begin_unwind
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:111:5
   3: core::option::Option<T>::unwrap
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/option.rs:778:21
   4: <arrow2::io::parquet::read::row_group::RowGroupDeserializer as core::iter::traits::iterator::Iterator>::next::{{closure}}
             at ./src/io/parquet/read/row_group.rs:69:25
   5: core::iter::adapters::map::map_try_fold::{{closure}}
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/iter/adapters/map.rs:91:28
   6: core::iter::traits::iterator::Iterator::try_fold
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/iter/traits/iterator.rs:2238:21
   7: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/iter/adapters/map.rs:117:9
   8: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/iter/adapters/mod.rs:195:9
   9: core::iter::traits::iterator::Iterator::try_for_each
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/iter/traits/iterator.rs:2299:9
  10: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/iter/adapters/mod.rs:178:9
  11: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/alloc/src/vec/spec_from_iter_nested.rs:26:32
  12: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/alloc/src/vec/spec_from_iter.rs:33:9
  13: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/alloc/src/vec/mod.rs:2748:9
  14: core::iter::traits::iterator::Iterator::collect
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/iter/traits/iterator.rs:1836:9
  15: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::{{closure}}
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/result.rs:2075:49
  16: core::iter::adapters::try_process
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/iter/adapters/mod.rs:164:17
  17: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/result.rs:2075:9
  18: core::iter::traits::iterator::Iterator::collect
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/iter/traits/iterator.rs:1836:9
  19: <arrow2::io::parquet::read::row_group::RowGroupDeserializer as core::iter::traits::iterator::Iterator>::next
             at ./src/io/parquet/read/row_group.rs:66:21
  20: <arrow2::io::parquet::read::file::FileReader<R> as core::iter::traits::iterator::Iterator>::next
             at ./src/io/parquet/read/file.rs:77:19
  21: parquet_read::main
             at ./examples/parquet_read.rs:42:24
  22: core::ops::function::FnOnce::call_once
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/ops/function.rs:507:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

So far as I've been able to debug the length of the map array as determined here (https://github.com/jorgecarleitao/arrow2/blob/main/src/array/map/mod.rs#L157) is coming back as 1, when expectations and a debug level print of the map array indicate it should have a length of two. Beyond that we're getting into the Offsets object which I'm not yet certain how to conceptualize.