sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Handle repeated fields with no annotation in record reader #151

Closed sadikovi closed 6 years ago

sadikovi commented 6 years ago

This PR fixes the issue of not being able to read files that contain specifically defined arrays/lists that not marked as LIST logical type. For example,

message user {
  REQUIRED INT32 id;
  OPTIONAL group phoneNumbers {
    REPEATED group phone {
      REQUIRED INT64 number;
      OPTIONAL BYTE_ARRAY kind (UTF8);
    }
  }
}

The current master branch code panics when reading this file, because we convert all of the group types as structs, but we should convert repeated phone field into a list of phone elements (number, kind).

Spark returns the following result when reading the file:

+---+-------------------------------------------------------------------------+
|id |phoneNumbers                                                             |
+---+-------------------------------------------------------------------------+
|1  |null                                                                     |
|2  |null                                                                     |
|3  |[WrappedArray()]                                                         |
|4  |[WrappedArray([5555555555,null])]                                        |
|5  |[WrappedArray([1111111111,home])]                                        |
|6  |[WrappedArray([1111111111,home], [2222222222,null], [3333333333,mobile])]|
+---+-------------------------------------------------------------------------+

It seems that we forgot to handle a special case that is mentioned here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L73.

I added the code to handle this case and also added new file with such schema and test case for it.

sadikovi commented 6 years ago

@sunchao Could you review this PR? Thanks!

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 594


Files with Coverage Reduction New Missed Lines %
record/reader.rs 89 87.17%
<!-- Total: 89 -->
Totals Coverage Status
Change from base Build 591: -0.03%
Covered Lines: 12013
Relevant Lines: 12581

💛 - Coveralls