Closed sadikovi closed 6 years ago
@sunchao I would appreciate if you could have a look at this pull request. Thanks!
@sunchao could you also have a look at this work? I made these changes since some of the files end up with bit packed level encoding. Thanks!
@sadikovi sure. Will take a look soon.
Should I add more tests? I feel like there are some cases that I have not covered.
Should I add more tests? I feel like there are some cases that I have not covered.
The current tests look OK for me. The only things I can think of are:
LevelEncoder.put()
, when rle_encoder.put()
or bit_packed_encoder.put_value()
calls return false, therefore num_decoded
!= buffer.len()
. This doesn't seem to be covered right now.LevelDecoder.get()
, when buffer.len()
> number of values in the decoder.BTW: some documents may be needed for the public functions of LevelDecoder/LevelEncoder. But I think we can address that in a separate PR.
Let me know if you have other thoughts :)
Yes, you are right - I will add test cases. Also agree that we should document functions - will do that.
@sunchao I added tests and documentation. Would appreciate if you could review them when you have time.
I left some inline comments on pull request, and overall changes are below:
num_values
field to keep track of total values left in decoder. I found that RLE and bit packed can return more values than there are actually (those additional values are zeros) - added test for itset_data
for data page v1 and set_data_range
for data page v2 to take num_buffered_values
as total number of values that decoder hasset_data
was called before getting values (initially it was only supported by RLE)Honestly, these changes become larger and larger, and I am not very confident in my code changes. Could you help me to double check any performance implications or better handling of situations when we need to check if set is called before get, and values increment, etc.? Thanks!
I will run manual tests on several files to make sure I can still read them with my changes:).
@sunchao thanks for the review! I will address comments.
@sunchao I updated the code, looks better now. I found that it fails when reading one of the examples I have, will try fixing it. Meanwhile, could you have a look, when you have time? Thanks!
@sunchao I updated function documentation - should be okay now. Also issue was related to repetition levels which is fixed in master, after rebase all my example files worked correctly.
@sunchao Should we merge this PR?:)
Merged. Thanks @sadikovi for the awesome work!
@sunchao thank you for reviews and merge!
This PR adds support for BIT_PACKED level encoder/decoder alongside RLE level encoder/decoder. Bit packed level decoder/encoder is treated as dev/null, and it is used mainly as placeholder for repetition level 0 -> parquet-mr dev/null encoder. I may be wrong with dev/null observation, but found this behaviour on several files - then repetition level > 0 - RLE, otherwise BIT_PACKED.
This implementation is working (not dummy like in parquet-mr), and allows to write and read levels with bit packing, similar to parquet-cpp.
Implementation notes:
LevelDecoder
andLevelEncoder
are now enums for RLE and BIT_PACKEDset_data()
method now takesnum_buffered_values
to calculate max length to decode for bit packed decoder, also value is used to keep track of values leftset_data_range()
method now takesnum_buffered_values
to keep track of total values leftget()
method is implemented as simple for loop, I added TODO to implementget_batch()
I added unit-tests for level encoding/decoding. I could not test bit packed level encoding/decoding when max level is greater than 0 on actual files (they are all 0 when it is bit packed).