Open heinzstockinger opened 8 years ago
That was my impression, with the SV callers I've tested also appearing to match that definition. The headers indicate that the interval is defined as the 95% confidence interval so there could still be a chance that the variant falls outside that window. Unfortunately, there doesn't seem to be any separation between a position uncertainty distribution due to incomplete support (eg: from discordant read pairs), and an exact uncertainty interval due to sequence homology. I've been using the IMPRECISE flag to distinguish between the two intervals but there's nothing in the spec explicitly stating this interpretation is correct.
Additionally, as there's no standard equivalent of CIEND for breakend notation, I've been using a non-standard field (CIRPOS) as an equivalent substitute, although I guess I could have just been overloading CIEND.
Cheers Daniel
On 26/02/2016 5:32 PM, heinzstockinger wrote:
I'm not sure how CIPOS and CIEND are really defined/usered. I understand that CI is a "Confidence Interval", so I assume that "CIPOS=-500,500" means -/+ 500 bases? such as
Thanks very much for your clarifications - seems we have the same interpretation. Might be good to add that example + interpretation to the spec since others might have the same question.
One minor question concerning the 95% confidence interval that you mention above: is that a default assumption (would be ok) or is that something encoded in the example but I just don't see it?
Thanks, Heinz
is that something encoded in the example but I just don't see it?
Having rechecked the spec, no the 95% is not in there and I stand mistaken. Looking over my SV caller comparisons, the 95% I was thinking of comes from the output of lumpy-sv which reports the full interval using the CIPOS field and the 95% interval in a (non-standard) CIPOS95 field.
On 26/02/2016 6:53 PM, heinzstockinger wrote:
Thanks very much for your clarifications - seems we have the same interpretation. Might be good to add that example + interpretation to the spec since others might have the same question.
One minor question concerning the 95% confidence interval that you mention above: is that a default assumption (would be ok) or is that something encoded in the example but I just don't see it?
Thanks, Heinz
— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/132#issuecomment-189152121.
ok, thank you, Daniel. Cheers, Heinz
To revisit this topic – Which of the following solutions seems preferable, to distinguish between the 'full interval' and the 95% confidence interval:
Whether one of these solutions is best, or another altogether, something should be added to VCFv4.3 to encourage a common usage across VCF.
Thoughts?
Thanks, Tim
I think that being able to identify the source of the confidence interval would be a useful addition. CIPOS can be due either to a) inexact breakend identification or b) homology at the breakend. Currently this is done indirectly through the presence of the IMPRECISE tag. Is there currently a need to differentiate between a 'full interval' and a 95%? I strongly prefer 1) as it is backward compatible, and doesn't break exact callers by changing the default interpretation of CIPOS to a meaningless value.
CIVALUE is also problematic in that it is not possible to specify both the full and 95% confidence intervals for a variant.
I don't have a specific preference for 1 or 2 (maybe 1 is preferable, following Daniel's arguments). I think it's important to clearly document how it is interpreted now and provide examples to avoid misinterpretation.
Let's try to get this resolved. The central questions seem to be:
Re #1: A "full interval" may be defined as that interval outside of which one is certain a variant's breakpoint does not lie. In other words, its coordinates may be considered "hard boundaries", i.e. two nucleotide positions between which the breakpoint must lie. This is to be contrasted with the "confidence interval", which can be defined as an interval within which there is a defined, statistical level of assuredness that a breakpoint lies – however, by definition, there is a complimentary probability that the breakpoint lies outside the defined interval.
As discussed above, breakpoint ambiguity can be due to:
a. inexact breakend identification b. homology at the breakend
The current VCF spec includes a mechanism to capture 'b': the HOMLEN and HOMSEQ INFO tags. Since these tags are only used when one can define them precisely, they can be used without recourse to CIPOS and CIEND, or any other mechanism aimed at capturing breakpoint uncertainty (in fact to do so would be redundant). With respect to 'a', structural variation data usually fall into either the "full interval" or the "confidence interval" concepts, as described above. If one needs an example of when a full interval is warranted, consider optical mapping data (e.g., BioNano), which has definite, basepair-resolution boundaries between which a detected variant's breakpoint is certain to fall. This is a different sort of data description than the standard confidence interval.
This is getting long and I haven't yet addressed the alternative solutions for "full" versus "confidence" intervals, but I will post it for your consideration and feedback.
Meanwhile, with respect to the IMPRECISE tag: @d-cameron earlier described one use of the tag; however that is not necessarily the way others use it. Unfortunately, the spec is unclear; it does not define what is meant by an "imprecise" variant; nor does it state in precisely what fashion the IMPRECISE tag is to be correctly used. In my experience the tag is redundant: If one is unsure of a variant's breakpoints, one uses CIPOS and CIEND to give an approximation of their locations; it is redundant to also use the IMPRECISE tag. In fact, the extended example VCF for structural variation (p.15 of my copy of v4.3) describes several imprecise variants without resorting to using the IMPRECISE tag. I see this as an argument for interpreting IMPRECISE as being redundant with CIPOS/CIEND.
If one accepts my argument that it is necessary to distinguish between full and confidence intervals, then the question becomes how best to do so.
As already noted on this ticket, the VCF spec is not explicit about defining CIPOS and CIEND. Without further guidance, I think a reasonable interpretation is that "confidence interval" means the more-or-less standard 95% confidence interval.
Perhaps someone who has more direct experience than I do using the multitude of SV callers out there may be able to weigh in here, on how CIPOS and CIEND are computed in various situations. The only caller specifically mentioned in this ticket is lumpy, which according to the description above uses CIPOS and CIEND to indicate full intervals, and a non-standard CIPOS95 (and presumably CIEND95) tag to indicate 95% confidence intervals. This usage is, as far as I know, unique; and while it represents one plausible solution, to adopt it as standard would imply assuming that the rest of the SV community has always used CIPOS and CIEND in the same way – which is far from obvious, and in fact may be exactly counter to the more obvious interpretation (of the 95% confidence interval).
If we can agree that the majority of the community interprets CIPOS and CIEND as 95% confidence intervals, then that is how they should continue to be used, and the spec needs to be updated to be explicit about it.
What would then remain is to define and agree upon a solution for expressing the full interval. Earlier I suggested adding a reserved INFO tag "CIVALUE" (=Float), which one could use in addition to CIPOS and CIEND to express the level of confidence when it is other than the default (95%). This solution is the converse of lumpy's, in that it posits the default confidence level to be 95%, rather than 100%.
Another solution could be an INFO flag "CI100" (or "CI1.0") that, when present, changed the interpretation of CIPOS and CIEND from the default 95% to 100%.
@d-cameron, earlier you said:
CIVALUE is also problematic in that it is not possible to specify both the full and 95% confidence intervals for a variant.
Can you provide an example (other than lumpy) when this might occur? Unless there is a compelling argument, I think this is proabably the rare exception that is outweighed by other arguments above.
@d-cameron, @heinzstockinger, @cyenyxe,
Please weigh in if you have input on this issue.
Thanks
I'm my GRIDSS output, I use CIPOS as both: a. inexact breakend identification (technically, I'm using 99.5% bounds, not 95%) b. homology at the breakend
the reason this is done is that, although the HOMLEN and HOMSEQ fields encode the homology length, they don't specify whether the homology is with respect to the reference or the sample (eg nearby SNV), nor can you easily compute breakend bounds from these fields.
HOMLEN/HOMSEQ don't specify where the homology is with respect to the called position and the calculation of that is decidedly non-trivial (especially if you want to do it properly and have to considered nearby phased SNVs/indels/other SVs). I use CIPOS (and CIRPOS instead of CIEND as I report in break-end notation to encode the actual homology bounds.
If CIPOS/CIEND are to be redefined as 95% confidence intervals, this begs the question as to what a 100% confidence interval is? Library insert size distributions have very long tails. I don't think anyone would actually seriously considering writing a 10,000bp interval because 5 reads out of a billion come from some large fragments.
Is there a need to capture and clearly distinguish between the concepts of "full" interval and "confidence" interval?
I'm inclined to say no. A "confidence" interval interpretation be inferred from the IMPRECISE flag, although even then the actual bounds reported can be exact, even if the length of the variant is imprecise. In terms of actual downstream usage, the CIPOS/CIEND values will either be taken as-is and incorporated into the call bounds (as my StructuralVariantAnnotation package does), or just ignored anyway.
What is the best way to do so, that accomplishes what is needed without breaking backward compatibility, nor forcing a change in the way existing terms are commonly understood and used?
I definitely prefer CIPOS95.
@d-cameron, thanks for your reply.
Would you please add a 'vcf' label to this issue? I am not able to.
I will respond more fully to all your points soon, but for now would you support writing into the VCF spec an interpretation of CI-POS/-END that specifies:
The above interpretation has the added advantage that it avoids (at least when expressing the full interval) assigning to POS a "made-up" value, i.e. one's "best estimate" of the breakpoint's location. I don't believe "guesstimated" values should be used for data points as central to the VCF model as is POS.
There are a couple of issues that I feel make that proposal unviable:
There's already an IMPRECISE flag, wouldn't it make sense just to add other flags to handle these interpretations?
The above interpretation has the added advantage that it avoids (at least when expressing the full interval) assigning to POS a "made-up" value, i.e. one's "best estimate" of the breakpoint's location. I don't believe "guesstimated" values should be used for data points as central to the VCF model as is POS.
Why is that? If all SV callers reported the most likely POS, then the difference in position between caller would be minimal. Isn't this a good thing? Left/right aligning makes sense when you know the event sequence and can call in a single VCF record (for some events in breakend notation, left aligning one breakend forces the other end to be right aligned), but when you don't know the exact sequence, doesn't it make sense to call the POS that is mostly likely to be correct?
Also:
TTTTTAAAAAAAAAATTTTTTT ..............AAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCC
^ ^ ?????^????? ^
exact exact ----- to ----> inexact exact
In the above example, the number of As in the variant sequence is not known exactly. It is inferred from the library fragment size distribution. The left breakend has exact bounds because the break is to polyA, the right bound of the right breakend is also exact, but the left bound is inferred from spanning read pairs and is thus inexact.
In VCF 4.2, p. 11 http://samtools.github.io/hts-specs/VCFv4.2.pdf
there is the following example:
CHROM POS ID REF ALT … INFO FORMAT NA000001
3 12665100 . A … SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500 GT:GQ:CN:CNQ ./.:0:3:16.2
I'm not sure how CIPOS and CIEND are really defined/used. I understand that CI is a "Confidence Interval", so I assume that "CIPOS=-500,500" means -/+ 500 bases? such as
POS 12665100 +/-500, END 12686200 +/-500
This would mean, that the duplication is in the following range:
[ [12 664 600,12 665 600] - [12 685 700,12 686 700] ]
does that sound correct to you?
Thank you for your help in advance, Heinz