manojkarthick / pqrs

Command line tool for inspecting Parquet files
Apache License 2.0
294 stars 29 forks source link

feat: compression info in schema subcommand #42

Closed SteveLauC closed 1 year ago

SteveLauC commented 1 year ago

What does this PR do

  1. Add compression information to pqrs schema -D

    $ pqrs cat 1.parquet
    
    ###############
    File: 1.parquet
    ###############
    
    {age: 18, name: "steve", timestamp: 0}
    
    $ ./target/debug/pqrs schema 1.parquet -D
    ...
    column 0:
    --------------------------------------------------------------------------------
    column type: INT64
    column path: "age"
    encodings: PLAIN RLE
    file path: N/A
    file offset: 57
    num of values: 1
    compression: UNCOMPRESSED
    total compressed size (in bytes): 53
    total uncompressed size (in bytes): 53
    data page offset: 4
    index page offset: N/A
    dictionary page offset: N/A
    statistics: {min: 18, max: 18, distinct_count: N/A, null_count: 0, min_max_deprecated: false}
    bloom filter offset: N/A
    offset index offset: 423
    offset index length: 10
    column index offset: 336
    column index length: 31
    
    column 1:
    --------------------------------------------------------------------------------
    column type: BYTE_ARRAY
    column path: "name"
    encodings: PLAIN RLE
    file path: N/A
    file offset: 170
    num of values: 1
    compression: UNCOMPRESSED
    total compressed size (in bytes): 48
    total uncompressed size (in bytes): 48
    data page offset: 122
    index page offset: N/A
    dictionary page offset: N/A
    statistics: {min: [115, 116, 101, 118, 101], max: [115, 116, 101, 118, 101], distinct_count: N/A, null_count: 0, min_max_deprecated: false}
    bloom filter offset: N/A
    offset index offset: 433
    offset index length: 11
    column index offset: 367
    column index length: 25
    
    column 2:
    --------------------------------------------------------------------------------
    column type: INT64
    column path: "timestamp"
    encodings: PLAIN RLE
    file path: N/A
    file offset: 264
    num of values: 1
    compression: UNCOMPRESSED
    total compressed size (in bytes): 53
    total uncompressed size (in bytes): 53
    data page offset: 211
    index page offset: N/A
    dictionary page offset: N/A
    statistics: {min: 0, max: 0, distinct_count: N/A, null_count: 0, min_max_deprecated: false}
    bloom filter offset: N/A
    offset index offset: 444
    offset index length: 11
    column index offset: 392
    column index length: 31

Closes #40

manojkarthick commented 1 year ago

Thanks a lot @SteveLauC - will try and get a release out tonight!

SteveLauC commented 1 year ago

Thanks a lot @SteveLauC - will try and get a release out tonight!

That would be great, but I remember that crate with git dependencies is not allowed to be published to crates.io, so I guess we still need to wait for the release of 40.0.0:(

manojkarthick commented 1 year ago

That’s true yeah. I’ve released it using homebrew and also created prebuilt binaries for various platforms/architectures, so that’s something I suppose :)