sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Level encoder should support writing levels for data page v2 #141

Closed sadikovi closed 6 years ago

sadikovi commented 6 years ago

This PR attempts to update level encoder to support data pages v2 levels, i.e. starting with offset 0 instead of mem::size_of<i32>().

When we write data page v1 levels, we encode the length of the levels in the first 4 bytes of the vector, hence mem::size_of<i32>() offset. With data page v2, we explicitly store the length of the levels, so we do not really need to those 4 bytes offset, and we would rather start with 0.

Level decoder already supports reading data page v1 and v2 by having two methods: set_data, which reads the first 4 bytes and sets the length, and set_data_range that allows to set start and length manually, since we know the start position and data page v2 header has the levels length.

I updated the relevant code and unit tests to reflect that.

sadikovi commented 6 years ago

@sunchao could you review this PR? I found this problem when writing tests for column writer.

Fix works, but IMHO is a bit messy, and I think there is disconnection between level decoder and level encoder when it comes to encoding and decoding for data page v1 and data page v2. Let me know if it is okay, or we should refactor the code somehow. Thanks!

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 579


Files with Coverage Reduction New Missed Lines %
encodings/levels.rs 7 97.55%
column/reader.rs 12 94.84%
<!-- Total: 19 -->
Totals Coverage Status
Change from base Build 571: 0.02%
Covered Lines: 11390
Relevant Lines: 11932

💛 - Coveralls
sadikovi commented 6 years ago

I think it might be better to have two level encoders and decoders for data page v1 and data page v2, it’s not that much of a code duplication and it makes it clear what functionality is supported for each.

sunchao commented 6 years ago

Thanks @sadikovi ! Yeah I'm not sure what's the most elegant way to handle this discrepancy between page v1 and v2. IMO we can just go with this approach and leave potential refactoring as a future work.

For reference, parquet-cpp doesn't seem to handle page v2. On the other hand, parquet-mr doesn't have a level encoder/decoder. Instead it (taking decoding path as example, encoding should be symmetric):

  1. if it's v1, initiate value reader based on encoding (RLE or BIT-PACKED), and the RLE (in our case RleValueDecoder) will handle the 4-byte length. In our case, we don't have a BIT-PACKED value reader since this is only used for decoding plain boolean values and our plain decoder already does that. BTW: I think #61 is not an issue anymore.

  2. if it's v2, initiate a "raw" RLE value decoder, and just pass the definition/repetition levels byte stream to it (unlike our page structure, parquet-mr has separate byte streams for def/rep levels for page v2).

IMO I don't see a clear advantage in either approach.

sadikovi commented 6 years ago

Let me try separating v1 from v2 and see if it makes things simpler.

sadikovi commented 6 years ago

@sunchao I updated the code to see if it is better to introduce RLE_V2 for Data Page v2, also did some minor refactoring. Let me know if this approach is better, or if I should revert to the previous approach. Thanks!

sadikovi commented 6 years ago

Thanks @sunchao. I will address your comments shortly and will also update the documentation for differences between v1 and v2, etc.

sadikovi commented 6 years ago

@sunchao I addressed your comments. Can you have a look again? Thanks!

sunchao commented 6 years ago

Looks good @sadikovi . Will merge shortly after the CI.

sadikovi commented 6 years ago

Thanks! On Tue, 7 Aug 2018 at 11:30 PM, Chao Sun notifications@github.com wrote:

Looks good @sadikovi https://github.com/sadikovi . Will merge shortly after the CI.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/141#issuecomment-411209075, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3qp1BTedDY0Gd0zrmAYdldnWDFOuks5uOgbogaJpZM4VvUPG .

sadikovi commented 6 years ago

Thanks for merging!