samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
644 stars 174 forks source link

VCF format: correct representation of complex indels and MNPs #774

Closed marc-sturm closed 3 months ago

marc-sturm commented 3 months ago

Hi,

I have a question about the correct representation of complex indels, i.e. insertion and deletion at the same position, and MNPs in VCF format.

Do complex indels need to be prefixed with the reference base or not. Here an example from a GRCh38 variant listed in ClinVar: 1 930222 . GAACTC TTCTTCTG According to my interpretation of the standard, this variant should be written as: 1 930221 . CGAACTC CTTCTTCTG

Similarly for multi-nucleotide polymorphisms would this be correct: 1 930222 . GA TT or this: 1 930221 . CGA CTT

Thanks, Marc

d-cameron commented 3 months ago

Both notations are valid and there is no 'correct' representation of these variants in VCF. From a VCF perspective, the only difference is the longer representation includes the starting C in the allele whereas the shorter version makes no claim on what the base is at position 930221.

As there are multiple competing approaches to variant normalisation, more recent formats such as VRS explicitly include variant normalisation (https://vrs.ga4gh.org/en/stable/impl-guide/normalization.html) in their specifications. VCF does not.

pd3 commented 3 months ago

Even though the specification does not require it, VCFs are routinely normalized. The consensus is to use a parsimonious, left-aligned representation. See for example https://genome.sph.umich.edu/wiki/Variant_Normalization.

marc-sturm commented 3 months ago

Thanks for the reply. Not specifying how exactly to handle these types variants in the specfication it is unhandy. For simple deletions and insertions it is clearly stated that the prefix reference base is required. I think that should be the case for all variant but SNVs. Different representations of the same variant should be avoided when possible imo.

@pd3 yes, VCF should be normalized to make variants comparable - we even have a tool for it: https://github.com/imgag/ngs-bits/blob/master/doc/tools/VcfLeftNormalize.md

pd3 commented 3 months ago

Yeah, there are many tools for that. I wrote one myself: http://samtools.github.io/bcftools/bcftools.html#norm

d-cameron commented 3 months ago

The consensus is to use a parsimonious, left-aligned representation.

Note that this consesnsus is not universal. For example, HGVS uses 3' (right) normalisation when defining variant coordinates. The widespread usage of both left and right normalisation schemes is, in part, why the VCF specifications themselves have remained silent on any sort of variant normalisation.