Closed GregoryKimball closed 7 months ago
@etseidl Do you think this would be useful... or is it a waste of bytes?
Do you think this would be useful... or is it a waste of bytes?
I did a quick look at the parquet-format site to see why it was added. It seems knowing that all pages are dictionary encoded helps with predicate pushdown. I assume you can just decode the dictionary page and eliminate entire column chunks if a filtering condition has no matches. Sounds like a good thing to add.
Is your feature request related to a problem? Please describe.
The
parquet-cpp-arrow
writer includes ColumnChunkencoding_stats
after the ColumnChunk statistics in the Parquet file footer. The encoding stats are useful for providing a total page count, trackingRLE_DICTIONARY
fallback toPLAIN
encoding, and verifying optional V2 encodings such asDELTA_BYTE_ARRAY
andDELTA_LENGTH_BYTE_ARRAY
.Parquet-tools is a simple command line interface to learn more about a parquet file.
Here is an example of the
encoding_stats
data from the writerparquet-cpp-arrow version 14.0.2
parquet-tools inspect --detail cpp-arrow.pq
parquet-tools inspect --detail cudf.pq