Open Hoeze opened 1 year ago
fyi, Pandas reads the file flawlessly:
In [1]: import pandas as pd
In [2]: df = pd.read_parquet("example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet")
In [3]: df
Out[3]:
chromosome position identifier reference alternate quality filter info_END info_TYPE info_SVTYPE
0 1 949523 [] C [T] NaN [] 949523 [SNP]
1 1 949696 [] C [CG] NaN [] 949696 [INDEL]
2 1 949739 [] G [T] NaN [] 949739 [SNP]
3 1 957605 [] G [A] NaN [] 957605 [SNP]
4 1 957693 [] A [T] NaN [] 957693 [SNP]
... ... ... ... ... ... ... ... ... ... ...
4765 1 247588456 [] G [A] NaN [] 247588456 [SNP]
4766 1 247588456 [] G [C] NaN [] 247588456 [SNP]
4767 1 247588469 [] T [C] NaN [] 247588469 [SNP]
4768 1 247588631 [] A [G] NaN [] 247588631 [SNP]
4769 1 247599355 [] A [G] NaN [] 247599355 [SNP]
[4770 rows x 10 columns]
Reading the schema works:
cat'ting it does not:
Here the (zipped) file: clinvar_chr1_pathogenic.vcf.gz.parquet.zip