Closed sadikovi closed 5 years ago
I think we should create a follow up ticket to handle default encodings depending on column type.
@sadikovi can you update the PR with rustfmt?
I'd like to discuss the approach:
Let me know if you have any other suggestions or objections to this approach. Thanks!
@sadikovi
Result
seems not nice too and it's also not quite compatible with how properties
are implemented at the moment.I noticed another issue though - it seems we use PLAIN_ENCODING
as the default fallback encoding for all encoding types, which may not be the best choice. We may want to use different encodings based on types and parquet version (i.e., v1 or v2). This is how parquet-mr is doing.
Yes, I was thinking about it. I can add it as follow up work. On Thu, 29 Nov 2018 at 7:41 AM, Chao Sun notifications@github.com wrote:
@sadikovi https://github.com/sadikovi
- the added trait looks fine to me. I think it clearly expresses the idea that we want to handle boolean type specially. Replacing it with a function sounds good to me as well but I don't have strong preference on either.
- I'm fine with the silent fallback. We should definitely not panic. Returning Result seems not nice too and it's also not quite compatible with how properties are implemented at the moment.
I noticed another issue though - it seems we use PLAIN_ENCODING as the default fallback encoding for all encoding types, which may not be the best choice. We may want to use different encodings based on types and parquet version (i.e., v1 or v2). This is how parquet-mr is doing.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/196#issuecomment-442723817, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3kkJGRGE2UDmlphs7inQ-zymZdKgks5uz4GmgaJpZM4Y4drt .
Actually, I can do all of that here, let me know if I should include default encodings depending on a column type work in this PR.
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
file/properties.rs | 16 | 91.04% | ||
column/writer.rs | 18 | 97.6% | ||
<!-- | Total: | 34 | --> |
Totals | |
---|---|
Change from base Build 670: | 0.09% |
Covered Lines: | 13196 |
Relevant Lines: | 13777 |
I think it may be better to open a new PR since these two are not so closely related.
Thanks. I looked at the changes, and it is not difficult to fix as part of this PR, and the changes are relatively small. I can push commit and then we can decide if we need to open another one. Will that be okay?
Yes. Sounds good.
On Thu, Nov 29, 2018 at 12:24 AM Ivan notifications@github.com wrote:
Thanks. I looked at the changes, and it is not difficult to fix as part of this PR, and the changes are relatively small. I can push commit and then we can decide if we need to open another one. Will that be okay?
— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/196#issuecomment-442746124, or mute the thread https://github.com/notifications/unsubscribe-auth/AAe7NxCjBnkD1x-I8dC0m2jikxLXJcVHks5uz5mhgaJpZM4Y4drt .
@sunchao I updated the PR with encoding write support. Can you have a look when you have time? Thanks!
Could you take a quick look at tests as well? I feel like they could be implemented better.
Thanks for merging!
This PR addresses the issue when we would enable dictionary encoding for boolean values regardless of the parquet format v1 or v2. This prevents file to be read by other systems that use parquet-mr.
In parquet-mr column writer use either PLAIN (v1) or RLE (v2) when writing for boolean values. In our library we would just use an alternative encoding that was set for a column, which defaults to PLAIN.
This PR changes the behaviour. Now if no encoding explicitly set for a column, we will use EncodingWriteSupport trait to check if column type supports dictionary encoding and what alternative fallback encoding we can use. The values match parquet-mr implementation for both V1 and V2. It also changes API for WriterProperties - now
encoding(column)
returns None, if no encoding is set (used to return default PLAIN encoding).Closes #195