Handle repeated fields with no annotation in record reader

sadikovi commented 6 years ago

This PR fixes the issue of not being able to read files that contain specifically defined arrays/lists that not marked as LIST logical type. For example,

message user {
  REQUIRED INT32 id;
  OPTIONAL group phoneNumbers {
    REPEATED group phone {
      REQUIRED INT64 number;
      OPTIONAL BYTE_ARRAY kind (UTF8);
    }
  }
}

The current master branch code panics when reading this file, because we convert all of the group types as structs, but we should convert repeated phone field into a list of phone elements (number, kind).

Spark returns the following result when reading the file:

+---+-------------------------------------------------------------------------+
|id |phoneNumbers                                                             |
+---+-------------------------------------------------------------------------+
|1  |null                                                                     |
|2  |null                                                                     |
|3  |[WrappedArray()]                                                         |
|4  |[WrappedArray([5555555555,null])]                                        |
|5  |[WrappedArray([1111111111,home])]                                        |
|6  |[WrappedArray([1111111111,home], [2222222222,null], [3333333333,mobile])]|
+---+-------------------------------------------------------------------------+

It seems that we forgot to handle a special case that is mentioned here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L73.

I added the code to handle this case and also added new file with such schema and test case for it.

sadikovi commented 6 years ago

@sunchao Could you review this PR? Thanks!

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 594

0 of 0 changed or added relevant lines in 0 files are covered.
89 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.03%) to 95.485%

Files with Coverage Reduction	New Missed Lines	%
record/reader.rs	89	87.17%
<!--	Total:	89		-->

Totals
Change from base Build 591:	-0.03%
Covered Lines:	12013
Relevant Lines:	12581

sunchao / parquet-rs

Handle repeated fields with no annotation in record reader #151

Pull Request Test Coverage Report for Build 594

💛 - Coveralls