Closed SteampunkIslande closed 6 months ago
Yes great idea, I addition, It would be also great to :
More the parquet output will contains flat data using, better it is.
To sum it up: Simplicity is the best and should always be the rule :)
Yes I am more in the full pipeline context. Some time ago I wrote a similar tool to covert VCF file to ndjson (using rust_htslib), I kept the same structure than the vcf one in my json, but this structure made everything much more complicated in the downstream workflow (ETL, queries, automatic processing...). So I am now rewriting it with the possibility to get an output as flat as possible in addition to the vcf structured one (and I also moved to noodles).
I understand that you want the flattest possible output, and if you find a generic way of doing it this would be great, but for now I don't see how a single function could handle all the cases at once.
Spliting FORMAT is more or less like splitting the multiallelic sites, you have to consider the additional NUMBER=G case when you split multiallelic fields. If you parse multiallelic site in vcf2parquet, i think it make sense to also parse the multi sample part, so in your parquet file one record contains on information : a variant in a sample.
Splitting Annotation is not as easy. You need to parse the Annotation description in the header for each annotation type and create a list of key. When you parse your INFO fields, each time you detect a Annotation key, you split the value string, and with a zip iterator you can associate each key from the header with each associated annotation value. The character used for the split can be different from annotation type to others. With expended annotation you can directly query your parquet file with a filter on the allele frequency in the european population and some clinvar criteria.
This would be great to split every multiallelic site into as many parquet records (one for each allele).
Already on it :)