jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 222 forks source link

`parquet_read` panics when working with `date64`s #1400

Closed kjschiroo closed 1 year ago

kjschiroo commented 1 year ago

The parquet_read example panics when reading the file generated by the following snippet of python:

import datetime

import pyarrow as pa
import pyarrow.parquet

print(f"pyarrow {pa.__version__}")

table = pa.Table.from_pydict(
    {
        "my_column": pa.array(
            [datetime.date(2022, 6, 28)],
            pa.date64()
        )
    }
)
with open("sample.parquet", "wb") as f:
    pa.parquet.write_table(table=table, where=f, version="2.6", data_page_version="2.0", compression="SNAPPY")

Generating the file:

> python3 minimal.py
pyarrow 11.0.0

Running parquet_read the first issue I run into appears to originate from reading statistics:

> RUST_BACKTRACE=1 cargo run --release --features io_parquet,io_parquet_compression --example parquet_read sample.parquet 
    Finished release [optimized] target(s) in 0.16s
     Running `target/release/examples/parquet_read sample.parquet`
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/io/parquet/read/statistics/primitive.rs:50:29
stack backtrace:
   0: rust_begin_unwind
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:111:5
   3: arrow2::io::parquet::read::statistics::primitive::push
   4: arrow2::io::parquet::read::statistics::push
   5: arrow2::io::parquet::read::statistics::deserialize
   6: parquet_read::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

If I comment out the statistics read (ln 24-27) of parquet_read.rs I get:

> RUST_BACKTRACE=1 cargo run --release --features io_parquet,io_parquet_compression --example parquet_read sample.parquet
   Compiling arrow2 v0.16.0 (/home/kjschiroo/Desktop/arrow2)
    Finished release [optimized] target(s) in 41.45s
     Running `target/release/examples/parquet_read sample.parquet`
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/io/parquet/read/deserialize/primitive/basic.rs:229:40
stack backtrace:
   0: rust_begin_unwind
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_bounds_check
             at /rustc/53e4b9dd74c29cc9308b8d0f10facac70bb101a7/library/core/src/panicking.rs:147:5
   3: arrow2::io::parquet::read::deserialize::utils::extend_from_decoder
   4: <arrow2::io::parquet::read::deserialize::primitive::basic::PrimitiveDecoder<T,P,F> as arrow2::io::parquet::read::deserialize::utils::Decoder>::extend_from_state
   5: arrow2::io::parquet::read::deserialize::utils::extend_from_new_page
   6: arrow2::io::parquet::read::deserialize::utils::next
   7: <arrow2::io::parquet::read::deserialize::primitive::integer::IntegerIter<T,I,P,F> as core::iter::traits::iterator::Iterator>::next
   8: <arrow2::io::parquet::read::deserialize::primitive::integer::IntegerIter<T,I,P,F> as core::iter::traits::iterator::Iterator>::next
   9: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
  10: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
  11: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
  12: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
  13: core::iter::adapters::try_process
  14: <arrow2::io::parquet::read::row_group::RowGroupDeserializer as core::iter::traits::iterator::Iterator>::next
  15: <arrow2::io::parquet::read::file::FileReader<R> as core::iter::traits::iterator::Iterator>::next
  16: <arrow2::io::parquet::read::file::FileReader<R> as core::iter::traits::iterator::Iterator>::next
  17: parquet_read::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Which is the error I'd originally stumbled upon. Any thoughts on what might be up?

kjschiroo commented 1 year ago

@jorgecarleitao Thanks for responding to this so quickly! I'd noticed your PR said that writing date64 to parquet is implementation-defined which I hadn't been aware of. Is there any source you'd be able to point me towards so I can better understand the amount of interoperability that I should expect between parquet files created and consumed by different libraries?

jorgecarleitao commented 1 year ago

In general the interoperability is high. The main exceptions are data types whose representation in one format (e.g. arrow) is not uniquely represented in another (e.g. parquet). In those cases, there is a tradeoff that libraries have to do.

In the case of date64, parquet supports dates in 32 bits. Arrow libraries must decide whether they write date64 in 32 bits parquet dates or in 64 bit parquet integers - this choice is implementation-defined.

Since date64 in Arrow is kind of useless because every value must be a multiple of 86400000 anyways, sticking to parquet int32 is likely best. Alternatively, avoid arrow date64 results in the highest possible compatibility.

The reference for pyarrow is here, where it says

(3) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

I hope this helps :)

kjschiroo commented 1 year ago

Thanks! That's exactly what I was looking for! I didn't realize that date64 was in milliseconds since the epoch. I'd just assumed it must have been days.