Closed marc-sturm closed 3 months ago
Both notations are valid and there is no 'correct' representation of these variants in VCF. From a VCF perspective, the only difference is the longer representation includes the starting C in the allele whereas the shorter version makes no claim on what the base is at position 930221.
As there are multiple competing approaches to variant normalisation, more recent formats such as VRS explicitly include variant normalisation (https://vrs.ga4gh.org/en/stable/impl-guide/normalization.html) in their specifications. VCF does not.
Even though the specification does not require it, VCFs are routinely normalized. The consensus is to use a parsimonious, left-aligned representation. See for example https://genome.sph.umich.edu/wiki/Variant_Normalization.
Thanks for the reply. Not specifying how exactly to handle these types variants in the specfication it is unhandy. For simple deletions and insertions it is clearly stated that the prefix reference base is required. I think that should be the case for all variant but SNVs. Different representations of the same variant should be avoided when possible imo.
@pd3 yes, VCF should be normalized to make variants comparable - we even have a tool for it: https://github.com/imgag/ngs-bits/blob/master/doc/tools/VcfLeftNormalize.md
Yeah, there are many tools for that. I wrote one myself: http://samtools.github.io/bcftools/bcftools.html#norm
The consensus is to use a parsimonious, left-aligned representation.
Note that this consesnsus is not universal. For example, HGVS uses 3' (right) normalisation when defining variant coordinates. The widespread usage of both left and right normalisation schemes is, in part, why the VCF specifications themselves have remained silent on any sort of variant normalisation.
Hi,
I have a question about the correct representation of complex indels, i.e. insertion and deletion at the same position, and MNPs in VCF format.
Do complex indels need to be prefixed with the reference base or not. Here an example from a GRCh38 variant listed in ClinVar: 1 930222 . GAACTC TTCTTCTG According to my interpretation of the standard, this variant should be written as: 1 930221 . CGAACTC CTTCTTCTG
Similarly for multi-nucleotide polymorphisms would this be correct: 1 930222 . GA TT or this: 1 930221 . CGA CTT
Thanks, Marc