sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Add EncodingWriteSupport for column writer #196

Closed sadikovi closed 5 years ago

sadikovi commented 5 years ago

This PR addresses the issue when we would enable dictionary encoding for boolean values regardless of the parquet format v1 or v2. This prevents file to be read by other systems that use parquet-mr.

In parquet-mr column writer use either PLAIN (v1) or RLE (v2) when writing for boolean values. In our library we would just use an alternative encoding that was set for a column, which defaults to PLAIN.

This PR changes the behaviour. Now if no encoding explicitly set for a column, we will use EncodingWriteSupport trait to check if column type supports dictionary encoding and what alternative fallback encoding we can use. The values match parquet-mr implementation for both V1 and V2. It also changes API for WriterProperties - now encoding(column) returns None, if no encoding is set (used to return default PLAIN encoding).

Closes #195

sadikovi commented 5 years ago

I think we should create a follow up ticket to handle default encodings depending on column type.

sunchao commented 5 years ago

@sadikovi can you update the PR with rustfmt?

sadikovi commented 5 years ago

I'd like to discuss the approach:

Let me know if you have any other suggestions or objections to this approach. Thanks!

sunchao commented 5 years ago

@sadikovi

  1. the added trait looks fine to me. I think it clearly expresses the idea that we want to handle boolean type specially. Replacing it with a function sounds good to me as well but I don't have strong preference on either.
  2. I'm fine with the silent fallback. We should definitely not panic. Returning Result seems not nice too and it's also not quite compatible with how properties are implemented at the moment.

I noticed another issue though - it seems we use PLAIN_ENCODING as the default fallback encoding for all encoding types, which may not be the best choice. We may want to use different encodings based on types and parquet version (i.e., v1 or v2). This is how parquet-mr is doing.

sadikovi commented 5 years ago

Yes, I was thinking about it. I can add it as follow up work. On Thu, 29 Nov 2018 at 7:41 AM, Chao Sun notifications@github.com wrote:

@sadikovi https://github.com/sadikovi

  1. the added trait looks fine to me. I think it clearly expresses the idea that we want to handle boolean type specially. Replacing it with a function sounds good to me as well but I don't have strong preference on either.
  2. I'm fine with the silent fallback. We should definitely not panic. Returning Result seems not nice too and it's also not quite compatible with how properties are implemented at the moment.

I noticed another issue though - it seems we use PLAIN_ENCODING as the default fallback encoding for all encoding types, which may not be the best choice. We may want to use different encodings based on types and parquet version (i.e., v1 or v2). This is how parquet-mr is doing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/196#issuecomment-442723817, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3kkJGRGE2UDmlphs7inQ-zymZdKgks5uz4GmgaJpZM4Y4drt .

sadikovi commented 5 years ago

Actually, I can do all of that here, let me know if I should include default encodings depending on a column type work in this PR.

coveralls commented 5 years ago

Pull Request Test Coverage Report for Build 680


Files with Coverage Reduction New Missed Lines %
file/properties.rs 16 91.04%
column/writer.rs 18 97.6%
<!-- Total: 34 -->
Totals Coverage Status
Change from base Build 670: 0.09%
Covered Lines: 13196
Relevant Lines: 13777

💛 - Coveralls
sunchao commented 5 years ago

I think it may be better to open a new PR since these two are not so closely related.

sadikovi commented 5 years ago

Thanks. I looked at the changes, and it is not difficult to fix as part of this PR, and the changes are relatively small. I can push commit and then we can decide if we need to open another one. Will that be okay?

sunchao commented 5 years ago

Yes. Sounds good.

On Thu, Nov 29, 2018 at 12:24 AM Ivan notifications@github.com wrote:

Thanks. I looked at the changes, and it is not difficult to fix as part of this PR, and the changes are relatively small. I can push commit and then we can decide if we need to open another one. Will that be okay?

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/196#issuecomment-442746124, or mute the thread https://github.com/notifications/unsubscribe-auth/AAe7NxCjBnkD1x-I8dC0m2jikxLXJcVHks5uz5mhgaJpZM4Y4drt .

sadikovi commented 5 years ago

@sunchao I updated the PR with encoding write support. Can you have a look when you have time? Thanks!

sadikovi commented 5 years ago

Could you take a quick look at tests as well? I feel like they could be implemented better.

sadikovi commented 5 years ago

Thanks for merging!