samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
632 stars 174 forks source link

VCF4.4 SVLEN requirement across different variant representations #769

Open tcezard opened 2 months ago

tcezard commented 2 months ago

I would like some confirmation on the definition of SVLEN when used with different representations of specific variants. I used Tandem Repeats as an example but that could be applicable to others.

Example bellow

##fileformat=VCFv4.4
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the longest variant described in this record">
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
##INFO=<ID=CN,Number=A,Type=Float,Description="Copy number of allele">
##INFO=<ID=RN,Number=A,Type=Integer,Description="Total number of repeat sequences in this allele">
##INFO=<ID=RUS,Number=.,Type=String,Description="Repeat unit sequence of the corresponding repeat sequence">
##INFO=<ID=RUL,Number=.,Type=Integer,Description="Repeat unit length of the corresponding repeat sequence">
##INFO=<ID=RUC,Number=.,Type=Float,Description="Repeat unit count of corresponding repeat sequence">
##INFO=<ID=RB,Number=.,Type=Integer,Description="Total number of bases in the corresponding repeat sequence">
##INFO=<ID=RUB,Number=.,Type=Integer,Description="Number of bases in each individual repeat unit">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set">
##ALT=<ID=CNV:TR,Description="Tandem repeat determined based on DNA abundance">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
chr1 130 . G GCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG . . GT:PS 1|0:100
chr1 130 . G <CNV:TR> END=130;SVLEN=1;CN=20;RUS=CAG;RN=1;RB=60 . GT:PS 1|0:100
chr1 130 . G <INS> END=130;SVLEN=60; . GT:PS 1|0:100

Did I interpret the specs correctly ?

I see two definitions of SVLEN: Length of the Structural Variant or length of the reference allele and I find this confusing. (I'm expecting I won't be the only one). Was this intended ?

Is it necessary to enforce absence of SVLEN value for non symbolic allele ?

d-cameron commented 2 months ago

END and SVLEN represent the same information. The new VCFv4.5 definition reflects this with END officially deprecated and redefined as computed field based on what's in SVLEN. Think of it in terms of END=POS+SVLEN (except for <INS>).

I see two definitions of SVLEN: Length of the Structural Variant or length of the reference allele and I find this confusing. (I'm expecting I won't be the only one). Was this intended?

Yes. There are indeed two definitions:

For \, [POS+1, POS+SVLEN] is deleted For \, [POS+1, POS+SVLEN] is tandemly duplication For \, [POS+1, POS+SVLEN] is inverted For \, [POS+1, POS+SVLEN] has a copy number of whatever's in the INFO CN field. For \, there are SVLEN additional bases between POS and POS+1

We could have split it up into SVLEN defining END (thus always SVLEN=1 for \) and had a separate field (e.g. INSLEN) to define the number of extra additional bases but that didn't happen so we're stuck being as backwards compatible as is reasonable^.

Example bellow Did I interpret the specs correctly ?

Almost. Your example is a bit unusual in that you're defining a novel TR insertion. Generally speaking TR callers report expansion/contraction of existing TRs and the <CNV:TR> records are defined over the length of the TR in the reference (see example in Section 5.7). For , SVLEN is just defining how long that TR is in the reference (i.e. what's in the TR catalogue). The actual expansion/contraction is defined in the RUS/RUL/RB/RUC/RUB INFO fields. SVLEN is just there so the CNV END is defined and CN is only there so it's still a valid \ record. Where your example is wrong is that you're defining a 60bp insertion over a 1bp long copy number interval so you have CN=61 copies of this interval, not CN=20. The CN field is defined for all <CNV> records. Yes this is a bit weird but it done this way so <CNV> parsers can handle <CNV:TR> records without needing to know anything about the R* fields.

The other issues is you've done is defined both a sequence allele and a symbolic \ allele for the same variant. Copy number records are in their own category so you can write a <CNV:TR> DNA abundance record as well as a direct sequence record without issue but the sequence record and <INS> will the treated as two records thus you're saying there's 120bp inserted just after POS 130 (in an unknown order of insertion. Use PSO to disambiguate this). If you're just demonstrating that you can write the same 60bp insertion multiple ways then yes, sequence allele and \ are fine as ALT alleles.

Is it necessary to enforce absence of SVLEN value for non symbolic allele ?

Technically it's not enforced - the wording is "should" not "must". If a implementation-defined field has a meaningful implementation-defined interpretation of SVLEN then the specs will allow it. If you've got a file that has SVLEN defined then it's not an invalid VCF, it's just not following the recommendations (There's draft 'SAM/VCF strict' specs designed as a set of validation rules to highlight issues with technically-compliant SAM/VCF file but there's still sitting as a PR as nobody's writing a validator that would use them.

^ Up to VCFv4.3 SVLEN was defined as the length difference between REF and ALT so was meaningless for \ and in practice every caller that reported \ used the VCFv4.4 redefinition as the length of the SV anyway.