samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
636 stars 174 forks source link

Question about VCF 4.2: CIPOS, CIEND #132

Open heinzstockinger opened 8 years ago

heinzstockinger commented 8 years ago

In VCF 4.2, p. 11 http://samtools.github.io/hts-specs/VCFv4.2.pdf

there is the following example:

CHROM POS ID REF ALT … INFO FORMAT NA000001

3 12665100 . A … SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500 GT:GQ:CN:CNQ ./.:0:3:16.2

I'm not sure how CIPOS and CIEND are really defined/used. I understand that CI is a "Confidence Interval", so I assume that "CIPOS=-500,500" means -/+ 500 bases? such as

POS 12665100 +/-500, END 12686200 +/-500

This would mean, that the duplication is in the following range:

[ [12 664 600,12 665 600] - [12 685 700,12 686 700] ]

does that sound correct to you?

Thank you for your help in advance, Heinz

d-cameron commented 8 years ago

That was my impression, with the SV callers I've tested also appearing to match that definition. The headers indicate that the interval is defined as the 95% confidence interval so there could still be a chance that the variant falls outside that window. Unfortunately, there doesn't seem to be any separation between a position uncertainty distribution due to incomplete support (eg: from discordant read pairs), and an exact uncertainty interval due to sequence homology. I've been using the IMPRECISE flag to distinguish between the two intervals but there's nothing in the spec explicitly stating this interpretation is correct.

Additionally, as there's no standard equivalent of CIEND for breakend notation, I've been using a non-standard field (CIRPOS) as an equivalent substitute, although I guess I could have just been overloading CIEND.

Cheers Daniel

On 26/02/2016 5:32 PM, heinzstockinger wrote:

I'm not sure how CIPOS and CIEND are really defined/usered. I understand that CI is a "Confidence Interval", so I assume that "CIPOS=-500,500" means -/+ 500 bases? such as

heinzstockinger commented 8 years ago

Thanks very much for your clarifications - seems we have the same interpretation. Might be good to add that example + interpretation to the spec since others might have the same question.

One minor question concerning the 95% confidence interval that you mention above: is that a default assumption (would be ok) or is that something encoded in the example but I just don't see it?

Thanks, Heinz

d-cameron commented 8 years ago

is that something encoded in the example but I just don't see it?

Having rechecked the spec, no the 95% is not in there and I stand mistaken. Looking over my SV caller comparisons, the 95% I was thinking of comes from the output of lumpy-sv which reports the full interval using the CIPOS field and the 95% interval in a (non-standard) CIPOS95 field.

On 26/02/2016 6:53 PM, heinzstockinger wrote:

Thanks very much for your clarifications - seems we have the same interpretation. Might be good to add that example + interpretation to the spec since others might have the same question.

One minor question concerning the 95% confidence interval that you mention above: is that a default assumption (would be ok) or is that something encoded in the example but I just don't see it?

Thanks, Heinz

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/132#issuecomment-189152121.

heinzstockinger commented 8 years ago

ok, thank you, Daniel. Cheers, Heinz

thefferon commented 8 years ago

To revisit this topic – Which of the following solutions seems preferable, to distinguish between the 'full interval' and the 95% confidence interval:

  1. Per lumpy-sv, use non-standard CIPOS95 to indicate 95% CI and the standard CIPOS to indicate the full interval. Note, this would imply the standard CIPOS should always be interpreted as indicating the full interval; however it is by definition a 'confidence interval' and so will most often be interpreted as one. A reasonable assumption is that it represents a 95% CI, in the absence of anything explicitly indicating otherwise.
  2. Add to the VCF spec an INFO tag 'CIVALUE=', where one could specify the confidence value that applies to CIPOS. E.g., "CIVALUE=0.95" would indicate that CIPOS and CIEND represent 95% confidence intervals, while "CIVALUE=1.0" would indicate that CIPOS and CIEND represent the full intervals. Per the reasoning in #1 above, if CIPOS and CIEND were present without a stated CIVALUE, then CIPOS and CIEND would be assumed to represent 95% confidence intervals.

Whether one of these solutions is best, or another altogether, something should be added to VCFv4.3 to encourage a common usage across VCF.

Thoughts?

Thanks, Tim

d-cameron commented 8 years ago

I think that being able to identify the source of the confidence interval would be a useful addition. CIPOS can be due either to a) inexact breakend identification or b) homology at the breakend. Currently this is done indirectly through the presence of the IMPRECISE tag. Is there currently a need to differentiate between a 'full interval' and a 95%? I strongly prefer 1) as it is backward compatible, and doesn't break exact callers by changing the default interpretation of CIPOS to a meaningless value.

d-cameron commented 8 years ago

CIVALUE is also problematic in that it is not possible to specify both the full and 95% confidence intervals for a variant.

heinzstockinger commented 8 years ago

I don't have a specific preference for 1 or 2 (maybe 1 is preferable, following Daniel's arguments). I think it's important to clearly document how it is interpreted now and provide examples to avoid misinterpretation.

thefferon commented 7 years ago

Let's try to get this resolved. The central questions seem to be:

  1. Is there a need to capture and clearly distinguish between the concepts of "full" interval and "confidence" interval?
  2. What is the best way to do so, that accomplishes what is needed without breaking backward compatibility, nor forcing a change in the way existing terms are commonly understood and used?

Re #1: A "full interval" may be defined as that interval outside of which one is certain a variant's breakpoint does not lie. In other words, its coordinates may be considered "hard boundaries", i.e. two nucleotide positions between which the breakpoint must lie. This is to be contrasted with the "confidence interval", which can be defined as an interval within which there is a defined, statistical level of assuredness that a breakpoint lies – however, by definition, there is a complimentary probability that the breakpoint lies outside the defined interval.

As discussed above, breakpoint ambiguity can be due to:

a. inexact breakend identification b. homology at the breakend

The current VCF spec includes a mechanism to capture 'b': the HOMLEN and HOMSEQ INFO tags. Since these tags are only used when one can define them precisely, they can be used without recourse to CIPOS and CIEND, or any other mechanism aimed at capturing breakpoint uncertainty (in fact to do so would be redundant). With respect to 'a', structural variation data usually fall into either the "full interval" or the "confidence interval" concepts, as described above. If one needs an example of when a full interval is warranted, consider optical mapping data (e.g., BioNano), which has definite, basepair-resolution boundaries between which a detected variant's breakpoint is certain to fall. This is a different sort of data description than the standard confidence interval.

This is getting long and I haven't yet addressed the alternative solutions for "full" versus "confidence" intervals, but I will post it for your consideration and feedback.

Meanwhile, with respect to the IMPRECISE tag: @d-cameron earlier described one use of the tag; however that is not necessarily the way others use it. Unfortunately, the spec is unclear; it does not define what is meant by an "imprecise" variant; nor does it state in precisely what fashion the IMPRECISE tag is to be correctly used. In my experience the tag is redundant: If one is unsure of a variant's breakpoints, one uses CIPOS and CIEND to give an approximation of their locations; it is redundant to also use the IMPRECISE tag. In fact, the extended example VCF for structural variation (p.15 of my copy of v4.3) describes several imprecise variants without resorting to using the IMPRECISE tag. I see this as an argument for interpreting IMPRECISE as being redundant with CIPOS/CIEND.

thefferon commented 7 years ago

If one accepts my argument that it is necessary to distinguish between full and confidence intervals, then the question becomes how best to do so.

As already noted on this ticket, the VCF spec is not explicit about defining CIPOS and CIEND. Without further guidance, I think a reasonable interpretation is that "confidence interval" means the more-or-less standard 95% confidence interval.

Perhaps someone who has more direct experience than I do using the multitude of SV callers out there may be able to weigh in here, on how CIPOS and CIEND are computed in various situations. The only caller specifically mentioned in this ticket is lumpy, which according to the description above uses CIPOS and CIEND to indicate full intervals, and a non-standard CIPOS95 (and presumably CIEND95) tag to indicate 95% confidence intervals. This usage is, as far as I know, unique; and while it represents one plausible solution, to adopt it as standard would imply assuming that the rest of the SV community has always used CIPOS and CIEND in the same way – which is far from obvious, and in fact may be exactly counter to the more obvious interpretation (of the 95% confidence interval).

thefferon commented 7 years ago

If we can agree that the majority of the community interprets CIPOS and CIEND as 95% confidence intervals, then that is how they should continue to be used, and the spec needs to be updated to be explicit about it.

What would then remain is to define and agree upon a solution for expressing the full interval. Earlier I suggested adding a reserved INFO tag "CIVALUE" (=Float), which one could use in addition to CIPOS and CIEND to express the level of confidence when it is other than the default (95%). This solution is the converse of lumpy's, in that it posits the default confidence level to be 95%, rather than 100%.

Another solution could be an INFO flag "CI100" (or "CI1.0") that, when present, changed the interpretation of CIPOS and CIEND from the default 95% to 100%.

@d-cameron, earlier you said:

CIVALUE is also problematic in that it is not possible to specify both the full and 95% confidence intervals for a variant.

Can you provide an example (other than lumpy) when this might occur? Unless there is a compelling argument, I think this is proabably the rare exception that is outweighed by other arguments above.

thefferon commented 7 years ago

@d-cameron, @heinzstockinger, @cyenyxe,

Please weigh in if you have input on this issue.

Thanks

d-cameron commented 7 years ago

I'm my GRIDSS output, I use CIPOS as both: a. inexact breakend identification (technically, I'm using 99.5% bounds, not 95%) b. homology at the breakend

the reason this is done is that, although the HOMLEN and HOMSEQ fields encode the homology length, they don't specify whether the homology is with respect to the reference or the sample (eg nearby SNV), nor can you easily compute breakend bounds from these fields.

HOMLEN/HOMSEQ don't specify where the homology is with respect to the called position and the calculation of that is decidedly non-trivial (especially if you want to do it properly and have to considered nearby phased SNVs/indels/other SVs). I use CIPOS (and CIRPOS instead of CIEND as I report in break-end notation to encode the actual homology bounds.

If CIPOS/CIEND are to be redefined as 95% confidence intervals, this begs the question as to what a 100% confidence interval is? Library insert size distributions have very long tails. I don't think anyone would actually seriously considering writing a 10,000bp interval because 5 reads out of a billion come from some large fragments.

Is there a need to capture and clearly distinguish between the concepts of "full" interval and "confidence" interval?

I'm inclined to say no. A "confidence" interval interpretation be inferred from the IMPRECISE flag, although even then the actual bounds reported can be exact, even if the length of the variant is imprecise. In terms of actual downstream usage, the CIPOS/CIEND values will either be taken as-is and incorporated into the call bounds (as my StructuralVariantAnnotation package does), or just ignored anyway.

What is the best way to do so, that accomplishes what is needed without breaking backward compatibility, nor forcing a change in the way existing terms are commonly understood and used?

I definitely prefer CIPOS95.

thefferon commented 7 years ago

@d-cameron, thanks for your reply.

Would you please add a 'vcf' label to this issue? I am not able to.

I will respond more fully to all your points soon, but for now would you support writing into the VCF spec an interpretation of CI-POS/-END that specifies:

The above interpretation has the added advantage that it avoids (at least when expressing the full interval) assigning to POS a "made-up" value, i.e. one's "best estimate" of the breakpoint's location. I don't believe "guesstimated" values should be used for data points as central to the VCF model as is POS.

d-cameron commented 7 years ago

There are a couple of issues that I feel make that proposal unviable:

There's already an IMPRECISE flag, wouldn't it make sense just to add other flags to handle these interpretations?

The above interpretation has the added advantage that it avoids (at least when expressing the full interval) assigning to POS a "made-up" value, i.e. one's "best estimate" of the breakpoint's location. I don't believe "guesstimated" values should be used for data points as central to the VCF model as is POS.

Why is that? If all SV callers reported the most likely POS, then the difference in position between caller would be minimal. Isn't this a good thing? Left/right aligning makes sense when you know the event sequence and can call in a single VCF record (for some events in breakend notation, left aligning one breakend forces the other end to be right aligned), but when you don't know the exact sequence, doesn't it make sense to call the POS that is mostly likely to be correct?

d-cameron commented 7 years ago

Also:

In the above example, the number of As in the variant sequence is not known exactly. It is inferred from the library fragment size distribution. The left breakend has exact bounds because the break is to polyA, the right bound of the right breakend is also exact, but the left bound is inferred from spanning read pairs and is thus inexact.