Closed kjschiroo closed 1 year ago
@jorgecarleitao Thanks for responding to this so quickly! I'd noticed your PR said that writing date64 to parquet is implementation-defined which I hadn't been aware of. Is there any source you'd be able to point me towards so I can better understand the amount of interoperability that I should expect between parquet files created and consumed by different libraries?
In general the interoperability is high. The main exceptions are data types whose representation in one format (e.g. arrow) is not uniquely represented in another (e.g. parquet). In those cases, there is a tradeoff that libraries have to do.
In the case of date64, parquet supports dates in 32 bits. Arrow libraries must decide whether they write date64 in 32 bits parquet dates or in 64 bit parquet integers - this choice is implementation-defined.
Since date64 in Arrow is kind of useless because every value must be a multiple of 86400000 anyways, sticking to parquet int32 is likely best. Alternatively, avoid arrow date64 results in the highest possible compatibility.
The reference for pyarrow is here, where it says
(3) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.
I hope this helps :)
Thanks! That's exactly what I was looking for! I didn't realize that date64 was in milliseconds since the epoch. I'd just assumed it must have been days.
The
parquet_read
example panics when reading the file generated by the following snippet of python:Generating the file:
Running
parquet_read
the first issue I run into appears to originate from reading statistics:If I comment out the statistics read (ln 24-27) of
parquet_read.rs
I get:Which is the error I'd originally stumbled upon. Any thoughts on what might be up?