Closed Mickael-van-der-Beek closed 1 year ago
Hi guys, I am interested in implementing this feature, here is a draft code to print the compression algorithm being used for every column:
use parquet::file::reader::{FileReader, SerializedFileReader};
use std::fs::File;
#[tokio::main]
async fn main() {
let file = File::open("parquet/1.parquet").unwrap();
let reader = SerializedFileReader::new(file).unwrap();
let mut column_names = Vec::new();
// read column names
let first_row = reader.into_iter().next().expect("expected at least 1 row");
let first_row_column_iter = first_row.get_column_iter();
first_row_column_iter.for_each(|(name, _)| column_names.push(name.to_string()));
let file = File::open("parquet/1.parquet").unwrap();
let reader = SerializedFileReader::new(file).unwrap();
let file_meta = reader.metadata();
for (idx, row_group) in file_meta.row_groups().iter().enumerate() {
println!("Row Group {}", idx);
for (idx, column) in row_group.columns().iter().enumerate() {
println!("\t{}: {}", column_names[idx], column.compression());
}
}
}
$ pqrs cat parquet/1.parquet
#######################
File: parquet/1.parquet
#######################
{age: 18, name: "steve", timestamp: 0}
$ cargo r -q
Row Group 0
age: UNCOMPRESSED
name: UNCOMPRESSED
timestamp: UNCOMPRESSED
If there are a lot of row
s in a parquet file, the output of the above program would be:
Row Group 0:
xxx: XXX
xxx: XXX
Row Group 1:
xxx: XXX
xxx: XXX
Row Group 2:
xxx: XXX
xxx: XXX
...
which is kinda messy, so I am curious that what output format would suit this subcommand? Friendly ping @manojkarthick, any idea?
Thanks for looking into this @SteveLauC - I think it would be best to include the compression algorithm used at a column level in the pqrs schema --detailed
command's output.
Sample output of pqrs schema --detailed
:
column 0:
--------------------------------------------------------------------------------
column type: INT64
column path: "epochTime"
encodings: PLAIN BIT_PACKED
file path: N/A
file offset: 4
num of values: 1
total compressed size (in bytes): 71
total uncompressed size (in bytes): 69
data page offset: 4
index page offset: N/A
dictionary page offset: N/A
statistics: {min: 1672531499, max: 1672531499, distinct_count: N/A, null_count: 0, min_max_deprecated: false}
I think it would be great to add the compression information as another field in this list alongside the "total compressed size".
I think it would be best to include the compression algorithm used at a column level in the
pqrs schema --detailed
command's output.
That would be great, then I will work on it:)
Just took a look at the source code of pqrs schema
, and it seems that we are using print_column_chunk_metadata()
form parquet
to print the metadata, I guess I need to add a patch to parquet
first then
Hi @manojkarthick, would you like to get #41 merged first so that we don't need to tackle dependency compatibility problems when implementing the PR for this issue:)
Hi @manojkarthick, would you like to get #41 merged first so that we don't need to tackle dependency compatibility problems when implementing the PR for this issue:)
I've merged #41, let me know if you need anything else (:
I originally thought parquet#4176 could be included in release 39.0.0
, but I was wrong, so we have to wait for the release of 40.0.0
:(
I originally thought parquet#4176 could be included in release
39.0.0
, but I was wrong, so we have to wait for the release of40.0.0
:(
@SteveLauC Maybe you want to use the git revision that includes this change in Cargo.toml
and raise a PR while we wait for this to get merged? It could be updated to 40.0.0
whenever that gets released.
Hello Manoj,
Very useful tool you have built!
One feature I would like to suggest is to display which compression algorithm was used on each column. Currently, it is possible to see that compression was used based on the size difference of the "total compressed size" and "total uncompressed size" sizes but the actual algorithm used doesn't seem to be displayed.
So the idea would be to show "GZIP", "LZO", "ZSTD", "Brotli", etc. in the schema description table.