natir / vcf2parquet

Convert vcf in parquet
MIT License
21 stars 2 forks source link

Split record by allele #10

Closed SteampunkIslande closed 6 months ago

SteampunkIslande commented 7 months ago

This would be great to split every multiallelic site into as many parquet records (one for each allele).

Already on it :)

GhisF commented 7 months ago

Yes great idea, I addition, It would be also great to :

More the parquet output will contains flat data using, better it is.

SteampunkIslande commented 7 months ago

To sum it up: Simplicity is the best and should always be the rule :)

GhisF commented 7 months ago

Yes I am more in the full pipeline context. Some time ago I wrote a similar tool to covert VCF file to ndjson (using rust_htslib), I kept the same structure than the vcf one in my json, but this structure made everything much more complicated in the downstream workflow (ETL, queries, automatic processing...). So I am now rewriting it with the possibility to get an output as flat as possible in addition to the vcf structured one (and I also moved to noodles).

SteampunkIslande commented 7 months ago

I understand that you want the flattest possible output, and if you find a generic way of doing it this would be great, but for now I don't see how a single function could handle all the cases at once.

GhisF commented 7 months ago

Spliting FORMAT is more or less like splitting the multiallelic sites, you have to consider the additional NUMBER=G case when you split multiallelic fields. If you parse multiallelic site in vcf2parquet, i think it make sense to also parse the multi sample part, so in your parquet file one record contains on information : a variant in a sample.

Splitting Annotation is not as easy. You need to parse the Annotation description in the header for each annotation type and create a list of key. When you parse your INFO fields, each time you detect a Annotation key, you split the value string, and with a zip iterator you can associate each key from the header with each associated annotation value. The character used for the split can be different from annotation type to others. With expended annotation you can directly query your parquet file with a filter on the allele frequency in the european population and some clinvar criteria.