Closed sadikovi closed 6 years ago
@sunchao Could you review this PR when you have time? Thanks!
@sunchao I updated the code. Sorry, I should have given a clear explanation in the PR description. Here it is (hope it helps with review as well).
This work could be split into 2 parts:
The first part is done in order for us to know when we need to fall back to another encoding once dictionary size or number of entries (whichever comes first) increase a certain threshold (configurable). Once either of them exceed the limit, we write dictionary page and bunch of data pages that are encoded with dictionary.
Then we switch to fallback encoding (theoretically configurable, but I have seen only PLAIN used, though I might have missed something) and write the rest of the column data in data pages.
The second part is about knowing when we have enough data for a data page (applies for all encodings, including dictionary fallback). As we write data, we periodically invoke estimated_data_encoded_size
to check if it exceeds the configurable threshold of data page size. If so, we finalise data page and prepare for another one. Unfortunately, we can only estimate the data page size, so it won't be exactly as limit. This is okay, because we will write data in mini batches, so the data page size will gradually increase.
This is what I could gather from parquet-cpp/parquet-mr.
Cool. Thanks for the explanation @sadikovi ! The latest change looks good to me.
Merged. Thanks @sadikovi !
This PR adds the following:
estimated_data_encoded_size
, which returns approximate (actual, whenever possible) size in bytes of the encoded data so far. For dictionary encoding this is encoded value indices. This can be used to decide if there is enough data to create a new data page.dict_encoded_size
method, which returns size in bytes of the dictionary entries. This can be used to decide if dictionary is full and we need to fall back to an alternative encoding (PLAIN).The actual implementations are derived from parquet-cpp, when there is an alternative, or checked with parquet-mr, and just overall logical sense. I had to add a new helper trait
DictEncodedSize
to get bytes for each value.I added tests for
dict_encoded_size
andestimated_data_encoded_size
for different encoders.