sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Add statistics to parquet-schema command display #158

Closed sadikovi closed 6 years ago

sadikovi commented 6 years ago

This PR adds statistics value to display for a column when running parquet-schema command with additional information display (parquet-schema /path/to/file true). Example of some prints are below:

column 0:
--------------------------------------------------------------------------------
column type: INT32
column path: "b_struct.b_c_int"
...
statistics: {min: N/A, max: N/A, distinct_count: N/A, null_count: 8, 
min_max_deprecated: true}

column 0:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "first"
...
statistics: {min: [97], max: [97], distinct_count: N/A, null_count: 0, 
min_max_deprecated: true}

column 10:
--------------------------------------------------------------------------------
column type: INT96
column path: "timestamp_col"
...
statistics: N/A

Unfortunately, statistics string can be a bit long, for example:

column 1:
--------------------------------------------------------------------------------
column type: DOUBLE
column path: "bp2"
...
dictionary page offset: 10788
statistics: {min: 69.05000000000001, max: 70.60000000000001, distinct_count: N/A, 
null_count: 0, min_max_deprecated: false}

Closes #156

sadikovi commented 6 years ago

@sunchao could you review this PR? Let me know if I should make any changes or add more tests. Thanks!

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 617


Files with Coverage Reduction New Missed Lines %
encodings/encoding.rs 1 94.77%
schema/printer.rs 6 70.52%
file/statistics.rs 10 92.71%
<!-- Total: 17 -->
Totals Coverage Status
Change from base Build 613: -0.06%
Covered Lines: 12414
Relevant Lines: 12993

💛 - Coveralls
sunchao commented 6 years ago

Looks good. I'm wondering if it's better to make this optional, i.e., only print it when an extra argument such as -statistics is specified. Thought?

sadikovi commented 6 years ago

That is a good idea, though we would have to provide true and -statistics flags and they should work together. My understanding was that if user provided true we show everything.

Slightly off-topic: I was thinking if we should improve our CLI tools, for example, have better handling of parameters and help display like cargo --help and have different modes for parquet-schema - --short for just schema, --ext for bunch of other properties and --all for everything.

I can add just an extra option for statistics or I can extend to support different options, like I mentioned above. Let me know what you think is better - I am happy to change either way!

sunchao commented 6 years ago

Agree that the CLI tool needs improvement - I was also thinking that we can enable colored output as well as making the output more formatted, etc.

I think we can just re-use the true option for everything now, and leave the aforementioned improvements as a followup. :)