sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Add dictionary size estimation and encoded data estimation for encoders #125

Closed sadikovi closed 6 years ago

sadikovi commented 6 years ago

This PR adds the following:

The actual implementations are derived from parquet-cpp, when there is an alternative, or checked with parquet-mr, and just overall logical sense. I had to add a new helper trait DictEncodedSize to get bytes for each value.

I added tests for dict_encoded_size and estimated_data_encoded_size for different encoders.

sadikovi commented 6 years ago

@sunchao Could you review this PR when you have time? Thanks!

sadikovi commented 6 years ago

@sunchao I updated the code. Sorry, I should have given a clear explanation in the PR description. Here it is (hope it helps with review as well).

This work could be split into 2 parts:

The first part is done in order for us to know when we need to fall back to another encoding once dictionary size or number of entries (whichever comes first) increase a certain threshold (configurable). Once either of them exceed the limit, we write dictionary page and bunch of data pages that are encoded with dictionary.

Then we switch to fallback encoding (theoretically configurable, but I have seen only PLAIN used, though I might have missed something) and write the rest of the column data in data pages.

The second part is about knowing when we have enough data for a data page (applies for all encodings, including dictionary fallback). As we write data, we periodically invoke estimated_data_encoded_size to check if it exceeds the configurable threshold of data page size. If so, we finalise data page and prepare for another one. Unfortunately, we can only estimate the data page size, so it won't be exactly as limit. This is okay, because we will write data in mini batches, so the data page size will gradually increase.

This is what I could gather from parquet-cpp/parquet-mr.

sunchao commented 6 years ago

Cool. Thanks for the explanation @sadikovi ! The latest change looks good to me.

sunchao commented 6 years ago

Merged. Thanks @sadikovi !