samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
655 stars 173 forks source link

Ambiguous interpretation of POS when using symbolic alleles #173

Closed ctsa closed 2 years ago

ctsa commented 8 years ago

There appear to be conflicting interpretations for the span of a VCF record using a symbolic allele. According to the following definition (in the REF field description), the span is [POS+1,END]:

https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L314-L317

"If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String ``$<$ID$>$'') then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism"

The above definition conflicts with the defacto definition of POS when using the symbolic <*> allele in gVCF, which suggests that a nonvariant gVCF block record span is [POS, END].

https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L1219

The "preceding base" convention is also not followed in various SV examples in the current spec where the intended span seems to be [POS,END]:

https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L992-L1001

https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L1058-L1069

Is there room to clarify the definition currently detailed in REF?

adamnovak commented 6 years ago

I'm working on the symbolic structural variant input in vgteam/vg, and I need to answer this question too.

Right now, the semantics I am working from, which I am implementing in https://github.com/vgteam/vg/pull/1929 are:

The current spec version says:

If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String “<ID>”) then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.

This would imply that my handling for <INS> and <DEL> is correct, but my <INV> handling is wrong, because that first base at POS should be anchoring and not part of the inversion. This would imply that converting an inversion from a multi-nucelotide substitution with a sequence in the ALT field to a symbolic representation would change its POS.

But there is also still this example inversion description in the breakends section:

http://i.imgur.com/vubBiCY.png

That clearly shows an inversion with a POS of 321682 describing an <INV> inversion where the first inverted base is base 321682.

Which is the correct semantics to implement for reading existing VCF files that use <INV> notation to describe inversions? I'm now leading towards the interpretation where you have to have a left anchoring base, even for <INV>, because that's what the spec literally says; the example might just be wrong. The behavior would be undefined when you want to have a symbolic variant up against the left end of a contig, and I'd have to rewrite my implementation, but it makes more sense to say that the spec has an incorrect example than that the text itself is wrong.

This is relevant to #266 which includes some clarification on how the SV information should be decoded, but which doesn't address this particular point.

Perhaps @cyenyxe or @lbergelson or @thefferon, who appear to be active in this area of the spec, have an opinion?

d-cameron commented 6 years ago

I was under the impression that for ins events, the insertion could be before or after as it just a replacement of the base at that position with new bases. If you have a ref A and you replace it with ACC then the insertion is after and if you replace it with CCA then the isertion is before.

On Thu., 11 Oct. 2018, 9:07 am Adam Novak, notifications@github.com wrote:

I'm working on the symbolic structural variant input in vgteam/vg, and I need to answer this question too.

Right now, the semantics I am working from, which I am implementing in vgteam/vg#1929 https://github.com/vgteam/vg/pull/1929 are:

-

For INS, the POS base is anchoring and the insert happens after it. SVLEN, if present, will give the number of new bases. END, if present, will be equal to POS.

For DEL, the POS base is anchoring and not deleted. SVLEN, if present, will be negative and represent the number of removed bases. END, if present, will be the position of the last deleted base.

For INV, the POS base is inverted. SVLEN, if present, will be 0, because no bases are added or removed. END, if present, will be the position of the last inverted base.

The current spec version says:

If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String “”) then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.

This would imply that my handling for and is correct, but my handling is wrong, because that first base at POS should be anchoring and not part of the inversion. This would imply that converting an inversion from a multi-nucelotide substitution with a sequence in the ALT field to a symbolic representation would change its POS.

But there is also still this example inversion description in the breakends section:

http://i.imgur.com/vubBiCY.png

That clearly shows an inversion with a POS of 321682 describing an inversion where the first inverted base is base 321682.

Which is the correct semantics to implement for reading existing VCF files that use notation to describe inversions? I'm now leading towards the interpretation where you have to have a left anchoring base, even for

, because that's what the spec literally says; the example might just be wrong. The behavior would be undefined when you want to have a symbolic variant up against the left end of a contig, and I'd have to rewrite my implementation, but it makes more sense to say that the spec has an incorrect example than that the text itself is wrong. This is relevant to #266 which includes some clarification on how the SV information should be decoded, but which doesn't address this particular point. Perhaps @cyenyxe or @lbergelson or @thefferon , who appear to be active in this area of the spec, have an opinion? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .
ctsa commented 6 years ago

@d-cameron I think this is also true for deletions. The spec doesn't say anything about whether a prefix or suffix is used, only that the ref and alt fields cannot be empty, thus forcing the use of a prefix or suffix base in the REF field for insertions and ALT field for deletions.

@adamnovak I'm still interested in this issue, but as it's been over two years I can't jump on this very quickly. I know that in addition to the comments I made in opening this issue, we struggled to get an exact base position definition from the spec for many SV types when writing manta -- the spec leaves room for interpretation by +/- 1 base for POS and END in many of the SV types -- it would be good to lock this down at some point but I'd have to find my previous notes to bring up all the specific cases. A few comments I can make now without digging anything up:

  1. For insertions END is not the same as POS. For a large insertion it is common to see a small deletion at the insertion point.
  2. The vcf INV type is especially problematic because this symbolic allele is used, in practice, to refer to either an inverted breakend pair or a simple 4-breakend inversion in its entirety. Apart from simulated datasets the latter case isn't very common in real data -- inversions are often coupled with deletions and other variations, or we see cases like fold-back inversions -- for all of these cases the 4-breakend vcf INV convention is not useful. I've seen @d-cameron comment on this issue before, and I believe GRIDSS works around this entirely by going straight to BND records? -- we will actually be replacing all INV records with BNDs (as well?) with the upcoming Manta 1.5 release because it's been such a headache trying to manage the ambiguity of this record type. Representing all inversions with BNDs is verbose, but it is at least unambiguous and doesn't limit your ability to represent any breakpoint homology/deletion for the inversion. You might consider just not accepting INV input to vg.
adamnovak commented 6 years ago

Unfortunately not accepting INV input isn't an option; we're gearing up to do a project on microinversions, which means we need a way to feed in inversion records. But on the plus side we're going to be generating our own input VCFs; we just have to work out what the right way to express these simple, 4-breakend inversions is according to the spec.

I'm not really sure what an "inverted breakend pair" is, or how you would specify it with an INV record. Is it just the junction of one position on one strand to another position on the other strand (i.e. half an inversion)? I agree that BND records is probably the right way to represent that.

If all the non-BND SVs are supposed to be the same kind of thing (i.e. complete events), I think the simple 4-breakend inversion interpretation is the right one, even if it's relatively uncommon. You could try and give it an SVLEN that is nonzero and say that bases were added or removed, but it's not clear which bases you would be referring to.

From @edawson's examples in https://github.com/vgteam/vg/issues/1926#issuecomment-429330604 it looks like some callers use SVLEN to hold the inverted sequence length, instead of a length change:

Lumpy reports an SVLEN > 0 for inversions:

1 16840374 71 N . . SVTYPE=INV;STRANDS=++:2,--:4;SVLEN=382449;END=17222823;CIPOS=-217,117;CIEND=-10,0;CIPOS95=-1 ...

Manta reports an SVLEN > 0 as well:

1 47908159 MantaINV:4784:1:2:0:0:0 G . MinSomaticScore END=216684962;SVTYPE=INV;SVLEN=168776803;IMPRECISE;CIPOS=-169,170;CIEND=-316,317;INV5;SOMATIC;SOMATICSCORE=16 PR

So trying to get anything out of an INV SVLEN in an existing dataset looks like a bad idea.

As for interpreting an insertion with END different from POS, I can see how that would be able to be well-defined (the base at END would be the last one removed, just as for a DEL), but it would help if the spec had some examples.

edawson commented 6 years ago

Apart from simulated datasets the latter case isn't very common in real data

While generally true, there are cases (e.g. RET-PTC gene fusions in thyroid cancer) where simple 4-break inversions are common and of high interest. While the BND notation allows expressing both simple and complex/compound inversion variants in the same format, it is not particularly human readable and I find it kind of beastly to work with when a simple DEL/INV/INS would do instead.

Another option would be to use a non-standard SV tag, as Broad's SvABA and dRanger tools do with the SPAN tag. Would this be a permissible way around the current spec issue?

d-cameron commented 6 years ago

@ctsa Yes, GRIDSS reports all events in breakend format.

I've been pushing to get a rework of the SV INFO fields in VCF for a while now, and put forward some proposals for VCF 4.3 but all but the clarification changes got pushed to a putative v4.4.

My biggest issue is that there is currently no separation or distinction between SV identification and SV classification. If a caller makes a DEL call, there is no way to tell whether the caller is claiming that there is a copy number loss in the deletion region (e.g. from micro-array or copy number caller), there is an adjacency between the start and end coordinates (e.g. breakpoint callers), or both (e.g. lumpy, but only sometimes). Attempting to reconstruct complex rearrangements is effectively impossible without knowing what the caller means by a call and that's not in the current standard.

Similarly, there is also no standard way to link multiple records into a larger event (my proposal repurposed the EVENT field for this, as well as adding some additional ones). For example there's currently no standard way to say that these adjacent intronic deletions are not deletions in the gene itself, but are part of a retrocopied process transcription insertion or a processed pseudogene homolog missing from the reference.

@edawson INV is particularly problematic as we have callers that call an INV based on only a single breakpoint. If you're lucky, the caller will use a non-standard field to indicate that the call is really only half a call, but different callers use different fields for these so it's in no way standardised.

BND format is indeed somewhat terrible but the single-record symbolic alleles DEL/INS/INV are way to limiting and you can't do any meaningful base-pair-accurate SV work using them. Sure, you'll occasionally get perfectly clean events, but you'll also get a lot of events with untemplated insertion sequence, or duplication or loss at one or both sides of an inversion. Your 'simple 4-break inversions' probably are rarely perfectly clean on both sides. You'll find you have to use BND notation for a non-trivial percentage of your variants and once you're doing this, it's easier to just use BND for everything and wishing you had some way to say that this set of BNDs is part of a 'simple' inversion event.

In any case, it sounds like some clarifications need to go into #266.

glennhickey commented 5 years ago

I stumbled on this too while using vg, and hope to see a clarification in 4.4. I understand the points in this thread about inversions sometimes being semi-defined or combined with other variation in practice, but I don't think that's relevant to the central issue: the spec should be clear bout (the arbitrary, as far as I'm concernted) choice of whether they start at POS or POS+1.

eroller commented 5 years ago

There is also this contradictory line in the spec. It doesn't mention symbolic alleles at all:

For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate.

https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L541

cyenyxe commented 5 years ago

@ctsa @adamnovak sorry it has taken so long to go back to this issue. What would you think of the following rewording? (forget the formatting, Markdown and TeX are not very compatible)

If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String $<$ID$>$'') then the padding base needs to be specified in REF, and POS denotes the coordinate of said base. This includes reference-only blocks, denoted as$<$*$>$'').

@ctsa mentioned that the de facto definition for gVCF is [POS, END], but I don't see any indications of that in the VCF specification, neither in text or in the examples. Given that the REF column follows the same rules as in any other symbolic allele, I am keen on making ALT follow the same rules too.

Unfortunately @adamnovak's concerns will still apply:

This would imply that converting an inversion from a multi-nucelotide substitution with a sequence in the ALT field to a symbolic representation would change its POS.

But addressing them at this point would involve severely changing the spec and the meaning of many VCF files. At least the behavior will be consistent within a single representation... And yeah, I think the example in the breakend representation section is wrong 😕 Once we agree on a solution for POS, I will change those accordingly.

jmarshall commented 5 years ago

I've never understood how symbolic alleles are supposed to be used. If the T in that example is supposed to be a padding base, how come it's not repeated in ALT?

REF   ALT
T     T<INV>
cyenyxe commented 5 years ago

Yeah, it is unfortunate that their representation is completely different from explicit sequences. Maybe instead of calling it the "padding base" when talking about symbolic alleles, we could use "preceding"?

pd3 commented 5 years ago

Symbolic alleles generally describe large events and often with imprecise boundaries. The position is often not known, the padding base wouldn't add any information. But we still want to be able to anchor it somewhere and the REF allele serves as a sanity check that the expected reference build is used.

lbergelson commented 5 years ago

@cyenyxe <*> is currently interpreted as being at POS rather than POS+1 (at least in GATK + friends) and changing it's meaning would be a nightmare at this point since it would change the meaning of every existing gvcf. It may be a sufficiently special that we could define pos for all symbolic alleles EXCEPT <*>.

There are other single base symbolic alleles though that I assume most people treat as being at POS. For instance IUPAC alt alleles are represented as <R> and my assumption would be that they do not have a padding base since they're essentially a SNP. I've never seen someone use one though so I could be mistaken though.