sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Add BIT_PACKED level encoder/decoder #25

Closed sadikovi closed 6 years ago

sadikovi commented 6 years ago

This PR adds support for BIT_PACKED level encoder/decoder alongside RLE level encoder/decoder. Bit packed level decoder/encoder is treated as dev/null, and it is used mainly as placeholder for repetition level 0 -> parquet-mr dev/null encoder. I may be wrong with dev/null observation, but found this behaviour on several files - then repetition level > 0 - RLE, otherwise BIT_PACKED.

This implementation is working (not dummy like in parquet-mr), and allows to write and read levels with bit packing, similar to parquet-cpp.

Implementation notes:

I added unit-tests for level encoding/decoding. I could not test bit packed level encoding/decoding when max level is greater than 0 on actual files (they are all 0 when it is bit packed).

sadikovi commented 6 years ago

@sunchao I would appreciate if you could have a look at this pull request. Thanks!

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.04%) to 88.645% when pulling 0e26dd434c6cb2cafdb7988c6e3fe8dbb9c161f3 on sadikovi:bit-packed-levels into 403e6f7ecd6b20e8026893b37f0171d99c7f2479 on sunchao:master.

sadikovi commented 6 years ago

@sunchao could you also have a look at this work? I made these changes since some of the files end up with bit packed level encoding. Thanks!

sunchao commented 6 years ago

@sadikovi sure. Will take a look soon.

sadikovi commented 6 years ago

Should I add more tests? I feel like there are some cases that I have not covered.

sunchao commented 6 years ago

Should I add more tests? I feel like there are some cases that I have not covered.

The current tests look OK for me. The only things I can think of are:

  1. in LevelEncoder.put(), when rle_encoder.put() or bit_packed_encoder.put_value() calls return false, therefore num_decoded != buffer.len(). This doesn't seem to be covered right now.
  2. similarly, in LevelDecoder.get(), when buffer.len() > number of values in the decoder.

BTW: some documents may be needed for the public functions of LevelDecoder/LevelEncoder. But I think we can address that in a separate PR.

Let me know if you have other thoughts :)

sadikovi commented 6 years ago

Yes, you are right - I will add test cases. Also agree that we should document functions - will do that.

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.3%) to 88.882% when pulling f8d5021c0e49150bfa3b835399eb09875db8b87e on sadikovi:bit-packed-levels into 403e6f7ecd6b20e8026893b37f0171d99c7f2479 on sunchao:master.

sadikovi commented 6 years ago

@sunchao I added tests and documentation. Would appreciate if you could review them when you have time.

I left some inline comments on pull request, and overall changes are below:

Honestly, these changes become larger and larger, and I am not very confident in my code changes. Could you help me to double check any performance implications or better handling of situations when we need to check if set is called before get, and values increment, etc.? Thanks!

I will run manual tests on several files to make sure I can still read them with my changes:).

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.3%) to 88.898% when pulling 1b2277934cad5e0581519960df7347db1271bd5e on sadikovi:bit-packed-levels into 403e6f7ecd6b20e8026893b37f0171d99c7f2479 on sunchao:master.

sadikovi commented 6 years ago

@sunchao thanks for the review! I will address comments.

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.3%) to 88.871% when pulling 281cfe11ae7b8d4100f06e4f20e8c3d51f8111fc on sadikovi:bit-packed-levels into 403e6f7ecd6b20e8026893b37f0171d99c7f2479 on sunchao:master.

sadikovi commented 6 years ago

@sunchao I updated the code, looks better now. I found that it fails when reading one of the examples I have, will try fixing it. Meanwhile, could you have a look, when you have time? Thanks!

sadikovi commented 6 years ago

@sunchao I updated function documentation - should be okay now. Also issue was related to repetition levels which is fixed in master, after rebase all my example files worked correctly.

sadikovi commented 6 years ago

@sunchao Should we merge this PR?:)

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.2%) to 88.871% when pulling 8ca0e62422c8c7ff7746ddf3250c07fc4ac4dedc on sadikovi:bit-packed-levels into 179e616b67c9763ccc4b183f8915166b9ea6ca59 on sunchao:master.

sunchao commented 6 years ago

Merged. Thanks @sadikovi for the awesome work!

sadikovi commented 6 years ago

@sunchao thank you for reviews and merge!