manojkarthick / pqrs

Command line tool for inspecting Parquet files
Apache License 2.0
294 stars 29 forks source link

Feature request: Compression algorithm information #40

Closed Mickael-van-der-Beek closed 1 year ago

Mickael-van-der-Beek commented 1 year ago

Hello Manoj,

Very useful tool you have built!

One feature I would like to suggest is to display which compression algorithm was used on each column. Currently, it is possible to see that compression was used based on the size difference of the "total compressed size" and "total uncompressed size" sizes but the actual algorithm used doesn't seem to be displayed.

So the idea would be to show "GZIP", "LZO", "ZSTD", "Brotli", etc. in the schema description table.

SteveLauC commented 1 year ago

Hi guys, I am interested in implementing this feature, here is a draft code to print the compression algorithm being used for every column:

use parquet::file::reader::{FileReader, SerializedFileReader};
use std::fs::File;

#[tokio::main]
async fn main() {
    let file = File::open("parquet/1.parquet").unwrap();
    let reader = SerializedFileReader::new(file).unwrap();
    let mut column_names = Vec::new();

    // read column names
    let first_row = reader.into_iter().next().expect("expected at least 1 row");
    let first_row_column_iter = first_row.get_column_iter();
    first_row_column_iter.for_each(|(name, _)| column_names.push(name.to_string()));

    let file = File::open("parquet/1.parquet").unwrap();
    let reader = SerializedFileReader::new(file).unwrap();
    let file_meta = reader.metadata();
    for (idx, row_group) in file_meta.row_groups().iter().enumerate() {
        println!("Row Group {}", idx);
        for (idx, column) in row_group.columns().iter().enumerate() {
            println!("\t{}: {}", column_names[idx], column.compression());
        }
    }
}
$ pqrs cat parquet/1.parquet

#######################
File: parquet/1.parquet
#######################

{age: 18, name: "steve", timestamp: 0}

$ cargo r -q
Row Group 0
        age: UNCOMPRESSED
        name: UNCOMPRESSED
        timestamp: UNCOMPRESSED

If there are a lot of rows in a parquet file, the output of the above program would be:

Row Group 0:
    xxx: XXX
    xxx: XXX
Row Group 1:
    xxx: XXX
    xxx: XXX
Row Group 2:
    xxx: XXX
    xxx: XXX
...

which is kinda messy, so I am curious that what output format would suit this subcommand? Friendly ping @manojkarthick, any idea?

manojkarthick commented 1 year ago

Thanks for looking into this @SteveLauC - I think it would be best to include the compression algorithm used at a column level in the pqrs schema --detailed command's output.

Sample output of pqrs schema --detailed:

column 0:
--------------------------------------------------------------------------------
column type: INT64
column path: "epochTime"
encodings: PLAIN BIT_PACKED
file path: N/A
file offset: 4
num of values: 1
total compressed size (in bytes): 71
total uncompressed size (in bytes): 69
data page offset: 4
index page offset: N/A
dictionary page offset: N/A
statistics: {min: 1672531499, max: 1672531499, distinct_count: N/A, null_count: 0, min_max_deprecated: false}

I think it would be great to add the compression information as another field in this list alongside the "total compressed size".

SteveLauC commented 1 year ago

I think it would be best to include the compression algorithm used at a column level in the pqrs schema --detailed command's output.

That would be great, then I will work on it:)


Just took a look at the source code of pqrs schema, and it seems that we are using print_column_chunk_metadata() form parquet to print the metadata, I guess I need to add a patch to parquet first then

SteveLauC commented 1 year ago

Hi @manojkarthick, would you like to get #41 merged first so that we don't need to tackle dependency compatibility problems when implementing the PR for this issue:)

manojkarthick commented 1 year ago

Hi @manojkarthick, would you like to get #41 merged first so that we don't need to tackle dependency compatibility problems when implementing the PR for this issue:)

I've merged #41, let me know if you need anything else (:

SteveLauC commented 1 year ago

I originally thought parquet#4176 could be included in release 39.0.0, but I was wrong, so we have to wait for the release of 40.0.0:(

manojkarthick commented 1 year ago

I originally thought parquet#4176 could be included in release 39.0.0, but I was wrong, so we have to wait for the release of 40.0.0:(

@SteveLauC Maybe you want to use the git revision that includes this change in Cargo.toml and raise a PR while we wait for this to get merged? It could be updated to 40.0.0 whenever that gets released.