jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 221 forks source link

Incorrect nullability inferred for nested parquet schema #1556

Closed jhorstmann closed 9 months ago

jhorstmann commented 10 months ago

I'm trying to read a parquet file that contains a struct inside a list using pola-rs and am getting null values for each element. I think I can track down the issue to the schema conversion from parquet to arrow.

The parquet_to_arrow_schema function tries to set the nullable flag of Field according to the parquet repetition levels. That flag is then used via the InitNested enum to calculate the level at which data is valid.

My message schema looks like the following:

message eventlog {
  REQUIRED group events (LIST) {
    REPEATED group array {
      REQUIRED BYTE_ARRAY event_name (STRING);
      REQUIRED INT64 event_time (TIMESTAMP(MILLIS,true));
    }
  }
}

And I would expect all fields having the is_nullable flag set to false. Instead the array field is marked as nullable. I think the issue can also be shown with the example schemas from parquet-format/LogicalTypes.md which are tested in test_parquet_lists. The comments there do not match the assertions. For example:

        // // List<String> (list nullable, elements non-null)
        // optional group my_list (LIST) {
        //   repeated group element {
        //     required binary str (UTF8);
        //   };
        // }
        {
            arrow_fields.push(Field::new(
                "my_list",
                DataType::List(Box::new(Field::new("element", DataType::Utf8, true))),
                true,
            ));
        }

        // // List<Integer> (nullable list, non-null elements)
        // optional group my_list (LIST) {
        //   repeated int32 element;
        // }
        {
            arrow_fields.push(Field::new(
                "my_list",
                DataType::List(Box::new(Field::new("element", DataType::Int32, true))),
                true,
            ));
        }

According to the comment and documentation element should not be nullable in both examples.

I do not yet have a standalone test case and example file, but will try to provide one later.

jhorstmann commented 10 months ago

Sample file: eventlog.zip, generated with the java implementation from parquet-mr.

The parquet-read tool from arrow-rs reads this without problems:

{case_id: "12345678", events: [[{event_name: "A", event_time: 1970-01-01 00:00:00 +00:00}, {event_name: "B", event_time: 1970-01-01 00:00:00 +00:00}, {event_name: "C", event_time: 1970-01-01 00:00:00 +00:00}]]}

parquet_read from arrow2 with an added dbg!(&chunk); gives this output:

Statistics {
    null_count: UInt64[0],
    distinct_count: UInt64[None],
    min_value: Utf8Array[12345678],
    max_value: Utf8Array[12345678],
}
Statistics {
    null_count: ListArray[[{event_name: 0, event_time: 0}]],
    distinct_count: ListArray[[{event_name: None, event_time: None}]],
    min_value: ListArray[[{event_name: A, event_time: 1970-01-01 00:00:00.001 +00:00}]],
    max_value: ListArray[[{event_name: C, event_time: 1970-01-01 00:00:00.003 +00:00}]],
}
[examples/parquet_read.rs:45] &chunk = Chunk {
    arrays: [
        Utf8Array[12345678],
        ListArray[[None, None, None]],
    ],
}
jhorstmann commented 9 months ago

Fixed by #1565