Closed jhorstmann closed 9 months ago
Sample file: eventlog.zip, generated with the java implementation from parquet-mr
.
The parquet-read
tool from arrow-rs
reads this without problems:
{case_id: "12345678", events: [[{event_name: "A", event_time: 1970-01-01 00:00:00 +00:00}, {event_name: "B", event_time: 1970-01-01 00:00:00 +00:00}, {event_name: "C", event_time: 1970-01-01 00:00:00 +00:00}]]}
parquet_read
from arrow2
with an added dbg!(&chunk);
gives this output:
Statistics {
null_count: UInt64[0],
distinct_count: UInt64[None],
min_value: Utf8Array[12345678],
max_value: Utf8Array[12345678],
}
Statistics {
null_count: ListArray[[{event_name: 0, event_time: 0}]],
distinct_count: ListArray[[{event_name: None, event_time: None}]],
min_value: ListArray[[{event_name: A, event_time: 1970-01-01 00:00:00.001 +00:00}]],
max_value: ListArray[[{event_name: C, event_time: 1970-01-01 00:00:00.003 +00:00}]],
}
[examples/parquet_read.rs:45] &chunk = Chunk {
arrays: [
Utf8Array[12345678],
ListArray[[None, None, None]],
],
}
Fixed by #1565
I'm trying to read a parquet file that contains a struct inside a list using pola-rs and am getting null values for each element. I think I can track down the issue to the schema conversion from parquet to arrow.
The
parquet_to_arrow_schema
function tries to set thenullable
flag ofField
according to the parquet repetition levels. That flag is then used via theInitNested
enum
to calculate the level at which data is valid.My message schema looks like the following:
And I would expect all fields having the
is_nullable
flag set tofalse
. Instead thearray
field is marked as nullable. I think the issue can also be shown with the example schemas from parquet-format/LogicalTypes.md which are tested intest_parquet_lists
. The comments there do not match the assertions. For example:According to the comment and documentation
element
should not be nullable in both examples.I do not yet have a standalone test case and example file, but will try to provide one later.