Add dictionary size estimation and encoded data estimation for encoders

sadikovi commented 6 years ago

This PR adds the following:

All encoders implement estimated_data_encoded_size, which returns approximate (actual, whenever possible) size in bytes of the encoded data so far. For dictionary encoding this is encoded value indices. This can be used to decide if there is enough data to create a new data page.
Dictionary encoder provides dict_encoded_size method, which returns size in bytes of the dictionary entries. This can be used to decide if dictionary is full and we need to fall back to an alternative encoding (PLAIN).

The actual implementations are derived from parquet-cpp, when there is an alternative, or checked with parquet-mr, and just overall logical sense. I had to add a new helper trait DictEncodedSize to get bytes for each value.

I added tests for dict_encoded_size and estimated_data_encoded_size for different encoders.

sadikovi commented 6 years ago

@sunchao Could you review this PR when you have time? Thanks!

sadikovi commented 6 years ago

@sunchao I updated the code. Sorry, I should have given a clear explanation in the PR description. Here it is (hope it helps with review as well).

This work could be split into 2 parts:

keeping track of the size in bytes for each unique value we record in the dictionary
keeping track of encoded values (dictionary values, or any values for other encodings)

The first part is done in order for us to know when we need to fall back to another encoding once dictionary size or number of entries (whichever comes first) increase a certain threshold (configurable). Once either of them exceed the limit, we write dictionary page and bunch of data pages that are encoded with dictionary.

Then we switch to fallback encoding (theoretically configurable, but I have seen only PLAIN used, though I might have missed something) and write the rest of the column data in data pages.

The second part is about knowing when we have enough data for a data page (applies for all encodings, including dictionary fallback). As we write data, we periodically invoke estimated_data_encoded_size to check if it exceeds the configurable threshold of data page size. If so, we finalise data page and prepare for another one. Unfortunately, we can only estimate the data page size, so it won't be exactly as limit. This is okay, because we will write data in mini batches, so the data page size will gradually increase.

This is what I could gather from parquet-cpp/parquet-mr.

sunchao commented 6 years ago

Cool. Thanks for the explanation @sadikovi ! The latest change looks good to me.

sunchao commented 6 years ago

Merged. Thanks @sadikovi !

sunchao / parquet-rs

Add dictionary size estimation and encoded data estimation for encoders #125