manojkarthick / pqrs

Command line tool for inspecting Parquet files
Apache License 2.0
296 stars 29 forks source link

pqrs fails to read valid parquet file #46

Open Hoeze opened 1 year ago

Hoeze commented 1 year ago

Reading the schema works:

```bash #> RUST_BACKTRACE=full pqrs schema example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet Metadata for file: example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet version: 2 num of rows: 4770 created by: Arrow2 - Native Rust implementation of Arrow metadata: ARROW:schema: /////+8DAAAEAAAA8v///xQAAAAEAAEAAAAKAAsACAAKAAQA+P///wwAAAAIAAgAAAAEAAoAAACAAwAAMAMAALACAABsAgAA5AEAAKABAAAgAQAA0AAAAEgAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACwAAAGluZm9fU1ZUWVBFAOz///9wAAAAZAAAABgAAAAMAAAAEAARAAQAAAAQAAgAAAAMAAEAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACQAAAGluZm9fVFlQRQAAAPz///8EAAQACQAAAGluZm9fVFlQRQAAAOz///84AAAAIAAAABgAAAACAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD0////IAAAAAEAAAAIAAkABAAIAAgAAABpbmZvX0VORAAAAADs////bAAAAGAAAAAYAAAADAAAABAAEQAEAAAAEAAIAAAADAABAAAABAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAYAAABmaWx0ZXIAAPz///8EAAQABgAAAGZpbHRlcgAA7P///zAAAAAgAAAAGAAAAAEDAAAQABIABAAQABEACAAAAAwAAAAAAPr///8BAAYABgAEAAcAAABxdWFsaXR5AOz///9wAAAAZAAAABgAAAAMAAAAEAARAAQAAAAQAAgAAAAMAAEAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACQAAAGFsdGVybmF0ZQAAAPz///8EAAQACQAAAGFsdGVybmF0ZQAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAkAAAByZWZlcmVuY2UAAADs////aAAAAFwAAAAYAAAADAAAABAAEQAEAAAAEAAIAAAADAABAAAABAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAIAAABpZAAA/P///wQABAAKAAAAaWRlbnRpZmllcgAA7P///zgAAAAgAAAAGAAAAAIAAAAQABEABAAAABAACAAAAAwAAAAAAPT///8gAAAAAQAAAAgACQAEAAgACAAAAHBvc2l0aW9uAAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAoAAABjaHJvbW9zb21lAA== message root { REQUIRED BYTE_ARRAY chromosome (STRING); REQUIRED INT32 position; REQUIRED group identifier (LIST) { REPEATED group list { REQUIRED BYTE_ARRAY id (STRING); } } REQUIRED BYTE_ARRAY reference (STRING); REQUIRED group alternate (LIST) { REPEATED group list { REQUIRED BYTE_ARRAY alternate (STRING); } } OPTIONAL FLOAT quality; REQUIRED group filter (LIST) { REPEATED group list { REQUIRED BYTE_ARRAY filter (STRING); } } REQUIRED INT32 info_END; REQUIRED group info_TYPE (LIST) { REPEATED group list { REQUIRED BYTE_ARRAY info_TYPE (STRING); } } REQUIRED BYTE_ARRAY info_SVTYPE (STRING); } ```

cat'ting it does not:

``` #> RUST_BACKTRACE=full pqrs head example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("insufficient values read from column - expected: 1024, got: 0")', /data/ouga/home/ag_gagneur/hoelzlwi/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-40.0.0/src/record/reader.rs:577:36 stack backtrace: 0: 0x55c1eab8c3a1 - std::backtrace_rs::backtrace::libunwind::trace::h6aeaf83abc038fe6 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5 1: 0x55c1eab8c3a1 - std::backtrace_rs::backtrace::trace_unsynchronized::h4f9875212db0ad97 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 2: 0x55c1eab8c3a1 - std::sys_common::backtrace::_print_fmt::h3f820027e9c39d3b at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:65:5 3: 0x55c1eab8c3a1 - ::fmt::hded4932df41373b3 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:44:22 4: 0x55c1eabb114f - core::fmt::rt::Argument::fmt::hc8ead7746b2406d6 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/fmt/rt.rs:138:9 5: 0x55c1eabb114f - core::fmt::write::hb1cb56105a082ad9 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/fmt/mod.rs:1094:21 6: 0x55c1eab8a071 - std::io::Write::write_fmt::h797fda7085c97e57 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/io/mod.rs:1713:15 7: 0x55c1eab8c1b5 - std::sys_common::backtrace::_print::h492d3c92d7400346 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:47:5 8: 0x55c1eab8c1b5 - std::sys_common::backtrace::print::hf74aa2eef05af215 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:34:9 9: 0x55c1eab8d537 - std::panicking::default_hook::{{closure}}::h8cad394227ea3de8 10: 0x55c1eab8d324 - std::panicking::default_hook::h249cc184fec99a8a at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:288:9 11: 0x55c1eab8d9ec - std::panicking::rust_panic_with_hook::h82ebcd5d5ed2fad4 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:705:13 12: 0x55c1eab8d8e7 - std::panicking::begin_panic_handler::{{closure}}::h810bed8ecbe66f1a at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:597:13 13: 0x55c1eab8c7d6 - std::sys_common::backtrace::__rust_end_short_backtrace::h1410008071796261 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:151:18 14: 0x55c1eab8d632 - rust_begin_unwind at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5 15: 0x55c1ea3efef3 - core::panicking::panic_fmt::ha0a42a25e0cf258d at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14 16: 0x55c1ea3f0393 - core::result::unwrap_failed::h100c4d67576990cf at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/result.rs:1651:5 17: 0x55c1ea58111c - parquet::record::reader::Reader::advance_columns::he78d66a8310bbc6d 18: 0x55c1ea581179 - parquet::record::reader::Reader::advance_columns::he78d66a8310bbc6d 19: 0x55c1ea581971 - ::next::h612da20bf81bedfa 20: 0x55c1ea40307f - pqrs::utils::print_rows::h9bf7a7f08e6bc5ee 21: 0x55c1ea3f9ec3 - pqrs::commands::head::execute::h2058003142e3c2ac 22: 0x55c1ea427b06 - pqrs::main::h38253338d29d66ac 23: 0x55c1ea3fea3d - std::sys_common::backtrace::__rust_begin_short_backtrace::h2f1f623026f1777f 24: 0x55c1ea41a5b8 - std::rt::lang_start::{{closure}}::hb53e3cd4c57743d8 25: 0x55c1eab84755 - core::ops::function::impls:: for &F>::call_once::h5ce27e764c284c0a at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:284:13 26: 0x55c1eab84755 - std::panicking::try::do_call::h4c1fc390ae241991 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:500:40 27: 0x55c1eab84755 - std::panicking::try::h4d36e7eaed86af72 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:464:19 28: 0x55c1eab84755 - std::panic::catch_unwind::h41cfb4dd65282b1e at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panic.rs:142:14 29: 0x55c1eab84755 - std::rt::lang_start_internal::{{closure}}::hfed411c1c5fdb925 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs:148:48 30: 0x55c1eab84755 - std::panicking::try::do_call::h6893f6f32a464342 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:500:40 31: 0x55c1eab84755 - std::panicking::try::h52b7102f469a0567 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:464:19 32: 0x55c1eab84755 - std::panic::catch_unwind::h62120054677916b5 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panic.rs:142:14 33: 0x55c1eab84755 - std::rt::lang_start_internal::hd66bf6b7da144005 at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs:148:20 34: 0x55c1ea428fa5 - main 35: 0x7ffbc5575d85 - __libc_start_main 36: 0x55c1ea3f065e - _start 37: 0x0 - ```

Here the (zipped) file: clinvar_chr1_pathogenic.vcf.gz.parquet.zip

Hoeze commented 1 year ago

fyi, Pandas reads the file flawlessly:

In [1]: import pandas as pd

In [2]: df = pd.read_parquet("example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet")

In [3]: df
Out[3]: 
     chromosome   position identifier reference alternate  quality filter   info_END info_TYPE info_SVTYPE
0             1     949523         []         C       [T]      NaN     []     949523     [SNP]            
1             1     949696         []         C      [CG]      NaN     []     949696   [INDEL]            
2             1     949739         []         G       [T]      NaN     []     949739     [SNP]            
3             1     957605         []         G       [A]      NaN     []     957605     [SNP]            
4             1     957693         []         A       [T]      NaN     []     957693     [SNP]            
...         ...        ...        ...       ...       ...      ...    ...        ...       ...         ...
4765          1  247588456         []         G       [A]      NaN     []  247588456     [SNP]            
4766          1  247588456         []         G       [C]      NaN     []  247588456     [SNP]            
4767          1  247588469         []         T       [C]      NaN     []  247588469     [SNP]            
4768          1  247588631         []         A       [G]      NaN     []  247588631     [SNP]            
4769          1  247599355         []         A       [G]      NaN     []  247599355     [SNP]            

[4770 rows x 10 columns]