Closed xiamaz closed 4 months ago
Will need reference FASTA and implement variant normalization (https://academic.oup.com/bioinformatics/article/31/13/2202/196142). This will need local re-sorting of variants, potentially line wise external memory sorting.
Closing for #447
Is your feature request related to a problem? Please describe. Currently VCFs containing multi-allelic sites need to be decomposed, whereas these are supported by
vep
. This limitation makes direct benchmark comparisons more limited.Describe the solution you'd like mehari should support multi allelics and simply decompose these while processing. As the used vcf parsing library already fully supports parsing multi-allelics, only changes in mehari should be necessary.
Describe alternatives you've considered Otherwise preprocessing using bcftools using e.g.
bcftools norm -m- -a
is necessary. As mehari also doesn't support directly reading from stdin, a write to disk is always necessary. This has significant impact on overall performance for non-normalized vcf.Additional context If possible performance penalty to normalized VCFs should be kept close to zero. This needs to be established when making changes.