Split record by allele - Githubissues

SteampunkIslande commented 7 months ago

This would be great to split every multiallelic site into as many parquet records (one for each allele).

Already on it :)

GhisF commented 7 months ago

Yes great idea, I addition, It would be also great to :

split INFO fields with annotation sub fields (like ANN, LOF, NMD, CSQ, VEP* ...).
to split records by sample, using FORMAT field names without the sample name, so all parquet files will have same field names for all similar FORMAT. (the sample name for the record can be register as a simple key/values.)

More the parquet output will contains flat data using, better it is.

The VEP field from Gnomad3, and Gnomad4 when the header description will be fixed.

SteampunkIslande commented 7 months ago

Splitting INFO fields would be nice, but I don't think this is the purpose of this library. I think this library should be a simple converter, leaving downstream analysis up to the developer.
Same as above, using basic parquet file transformations (I'm thinking polars, arrow or duckdb, all of which are available either in python or in Rust) could get you this result. I think of vcf2parquet more as an entrypoint for data transformation, rather than a full pipeline.

To sum it up: Simplicity is the best and should always be the rule :)

GhisF commented 7 months ago

Yes I am more in the full pipeline context. Some time ago I wrote a similar tool to covert VCF file to ndjson (using rust_htslib), I kept the same structure than the vcf one in my json, but this structure made everything much more complicated in the downstream workflow (ETL, queries, automatic processing...). So I am now rewriting it with the possibility to get an output as flat as possible in addition to the vcf structured one (and I also moved to noodles).

SteampunkIslande commented 7 months ago

I understand that you want the flattest possible output, and if you find a generic way of doing it this would be great, but for now I don't see how a single function could handle all the cases at once.

GhisF commented 7 months ago

Spliting FORMAT is more or less like splitting the multiallelic sites, you have to consider the additional NUMBER=G case when you split multiallelic fields. If you parse multiallelic site in vcf2parquet, i think it make sense to also parse the multi sample part, so in your parquet file one record contains on information : a variant in a sample.

Splitting Annotation is not as easy. You need to parse the Annotation description in the header for each annotation type and create a list of key. When you parse your INFO fields, each time you detect a Annotation key, you split the value string, and with a zip iterator you can associate each key from the header with each associated annotation value. The character used for the split can be different from annotation type to others. With expended annotation you can directly query your parquet file with a filter on the allele frequency in the european population and some clinvar criteria.

natir / vcf2parquet

Split record by allele #10