sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Add to_thrift conversion for schema type, fix inconsistencies for unset properties #130

Closed sadikovi closed 6 years ago

sadikovi commented 6 years ago

This PR adds to_thrift conversion, so we can serialise schema into parquet::SchemaElement. I added tests to verify that applying to_thrift -> from_thrift returns exactly the same schema.

Several things have been found and fixed in the process:

Return type in from_thrift

I changed parameter type for from_thrift from &mut [SchemaElement] to &[SchemaElement], since there is no need to modify the slice.

Inconsistencies when setting undefined precision, scale, length

Inconsistencies with respect to precision, scale and length having 0 or -1 to indicate unset/undefined value. In some parts of the code we were passing 0, in others we were setting -1. I propose that we treat -1 for precision, scale, and length as undefined value, and 0 as a valid value (of course, it depends, for example precision must always be positive). This means that when we use builder and set physical type as FIXED_LEN_BYTE_ARRAY we have to provide length; when we use builder and set logical type as DECIMAL we have to provide precision and scale, even if it is 0. Note that parser still handles the case when decimal definition does not have scale value, because it is considered to be optional for schema string.

sadikovi commented 6 years ago

I also found that there is a bit of inconsistency between parquet-cpp and parquet-mr regarding FIXED_LEN_BYTE_ARRAY having length as 0. Is it correct to have length 0? Currently we allow this, and I personally think it is okay, it is just indicates that array is empty, but would love to hear other opinions!

sadikovi commented 6 years ago

@sunchao Could you review this PR when you have time? Thanks!

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 538


Files with Coverage Reduction New Missed Lines %
encodings/encoding.rs 1 94.64%
schema/types.rs 6 96.68%
schema/parser.rs 10 96.04%
<!-- Total: 17 -->
Totals Coverage Status
Change from base Build 533: 0.03%
Covered Lines: 10544
Relevant Lines: 11067

💛 - Coveralls
sunchao commented 6 years ago

Is it correct to have length 0?

Yes I think it is correct too. Thanks for pointing out!

sunchao commented 6 years ago

Merged. Thanks @sadikovi .