varfish-org / mehari

VEP-like tool for sequence ontology and HGVS annotation of VCF files
MIT License
16 stars 1 forks source link

Support multi-allelic VCF #424

Closed xiamaz closed 4 months ago

xiamaz commented 6 months ago

Is your feature request related to a problem? Please describe. Currently VCFs containing multi-allelic sites need to be decomposed, whereas these are supported by vep. This limitation makes direct benchmark comparisons more limited.

Describe the solution you'd like mehari should support multi allelics and simply decompose these while processing. As the used vcf parsing library already fully supports parsing multi-allelics, only changes in mehari should be necessary.

Describe alternatives you've considered Otherwise preprocessing using bcftools using e.g. bcftools norm -m- -a is necessary. As mehari also doesn't support directly reading from stdin, a write to disk is always necessary. This has significant impact on overall performance for non-normalized vcf.

Additional context If possible performance penalty to normalized VCFs should be kept close to zero. This needs to be established when making changes.

holtgrewe commented 6 months ago

Will need reference FASTA and implement variant normalization (https://academic.oup.com/bioinformatics/article/31/13/2202/196142). This will need local re-sorting of variants, potentially line wise external memory sorting.

xiamaz commented 4 months ago

Closing for #447