Open tfenne opened 5 years ago
Yes, the intention was to allow either *
or [ACGTN]+
and the combination should be prohibited.
Some related discussion here that I missed previously: https://github.com/samtools/hts-specs/issues/151
And an issue/question in Octopus which uses *
followed by bases to represent partially spanned alleles: https://github.com/luntergroup/octopus/issues/75
@lbergelson, @pd3 & @jmmut pinging you guys since you are the currently listed VCF maintainers. Is there a path to making decision here in the near future? It's becoming a little awkward since Octopus uses alleles of \**[ACGT]+
and currently HTSJDK cannot read VCFs containing those alleles, which makes it a) impossible to use Picard of GATK tools with Octopus VCFs and b) near-impossible to write custom code that uses HTSJDK to process Octopus VCFs (see https://github.com/samtools/htsjdk/pull/1413).
My sense is that since this isn't explicitly dis-allowed by the spec and that @dancooke has shown a good use case for alleles that start with *
and continue with bases, and that it's never been disallowed by the specification, we should probably just formalize that it's ok and give examples of how it can be used to represent partially-spanned alleles. But the worst case is to have uses like this in the wild and reference implementations that don't want to support that syntax even though it's not in violation of the spec.
If folks are on board with formalizing the usage from octopus I'm happy to put together a PR to the spec. And if that's a non-starter could someone else make a PR to clarify the restrictions on use of *
please?
@tfenne I'm not totally sold on this. It's not explicitly disallowed by the spec, but it was definitely not intended.
I see how it's useful for figuring out a partially spanned allele, but it seems like it could add a lot of complication to an already error prone and poorly supported feature.
Is the intent to support alleles starting with *
or do you want to allow *
at any position in the read? We could potentially have very complicated alleles overlapping multiple things specified at other points in the vcf. It seems like supporting this will open up another set of equivalent ways to represent the same haplotypes in a vcf.
@lbergelson I would argue that the 'symbolic' interpretation adds even more complexity as it adds a completely distinct concept. The base-*
interpretation simply extends the set of allowed ALT bases by one, essentially meaning "already reported upstream".
I think the definition I gave implicitly requires any *
to be a the head of the allele - otherwise the allele that is causing the non-head-*
would be reported in a downstream record, contradicting the definition.
@lbergelson What I really want is for the spec to be 100% clear on what is and isn't supported, and for HTSJDK to be in line with that, and for Octopus to generate VCFs that are spec compliant. I tend to think clarifying the spec to allow one or more *
s followed by 0 or more bases is a good approach as it doesn't significantly complicate what's there now, and it doesn't penalize @dancooke for implementing a feature in a spec-compliant (albeit underspecified) way. But mostly I'm just looking for alignment between the spec, Octopus and HTSJDK.
I prefer to disallow an ALT like **AAC
. Due to the way VCF is designed, we often need to look at the previous records to make the right decision at the current record. Having **AAC
doesn't solve this problem and further introduces additional conceptual and technical complexity, and more ways to generate inconsistent VCFs.
There is actually a more fatal problem with the symbolic *
. Considering again the example given by @tfenne:
# Symbolic * representation
chr1 1000 GTATATA G ... 0|1
chr1 1005 TAGCGA T,* ... 1|2
# Base * representation
chr1 1000 GTATATA G ... 0|1
chr1 1005 TAGCGA T,**GCGA ... 1|2
Both of these examples are actually inconsistent as the first record states the reference (A
) on the first haplotype at chr1:1006
while the second states that this base is deleted. Octopus currently deals with this by using .
:
chr1 1000 GTATATA G ... .|1
chr1 1005 TAGCGA T,**GCGA ... 1|2
But this isn't particularly satisfactory as it doesn't make a strong claim about the reference between chr1:1000-1004
. I don't see a way around this using the symbolic representation. However, it's fairly simple to extend the definition of base-*
to something like:
This base is stated by an overlapping record.
Then the output would be
chr1 1000 GTATATA G,GTATA** ... 2|1
chr1 1005 TAGCGA T,**GCGA ... 1|2
Which is consistent and makes it simple to reconstruct haplotypes as there no need to remember position information between overlapping records.
The haplotype alignment is:
Pos: 1234567890123456789012345
Ref: XXXXXXXXXGTATATAGCGAXXXXX
H1: XXXXXXXXXGTATAT-----XXXXX
H2: XXXXXXXXXG------GCGAXXXXX
Your final VCF is wrong. The first line should be:
chr1 1000 GTATATA G,GTATAT* ... 2|1
A VCF parser can only detect this error when it comes to the second line. To validate a VCF, you always need to look at multiple lines. Your proposal is not making things simpler. Furthermore, your proposal allows inconsistencies between the two lines. Wrong second lines could look like:
chr1 1005 TAGCGA T,**GCA ... 1|2
or
chr1 1005 TAGCGA T,****CA ... 1|2
A proper design should try to minimize such errors.
The right way to represent this alignment is:
chr1 1000 GTATATA G,GTATAT 2|1
chr1 1005 TAGCGA T,* 1|2
It is easy enough to reconstruct the alignment. There are fewer ways to create wrong VCFs.
@lh3
The first line should be:
chr1 1000 GTATATA G,GTATAT* ... 2|1
Nope. The T
before the *
is (and must be) determined by the next record.
The right way to represent this alignment is:
chr1 1000 GTATATA G,GTATAT 2|1 chr1 1005 TAGCGA T,* 1|2
Completely disagree. You're permitting duplicate information in one direction but not the other. Not only is this inconsistent, it's also very confusing - It looks like the first record is calling a 1bp deletion, but you don't know that this isn't the case until you see the next record, and the first gives no indication you must do this (unlike the way I proposed).
There are fewer ways to create wrong VCFs.
I think this is a poor argument. For a start, the number of tools generating VCFs de novo is small compared with the number that parse/manipulate them, and I'd strongly argue that the base *
representation makes parsing VCFs less error prone than symbolic *
(or your proposed "right way") as *
has clear meaning - it essentially says "ignore this base". Moreover, if you're generating inconsistent VCFs with this representation then probably you have some deeper problem with your algorithms internals. I'd rather have an output format that forces me to address this rather than allowing me to sweep it under the rug.
Would also be good to hear @Lenbok's thoughts on this - he added support for base *
to RTG Tools vcfeval.
Nope. The T before the * is (and must be) determined by the next record.
Then we are interpreting your *
differently (I see your *
as padding). This complicates an already complex feature.
It is worth mentioning the freebayes way to encode this VCF. See a more complex example:
Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1: XXXXXXXXXGTATAT------XXXXX
H2: XXXXXXXXXG------GCTGAXXXXX
The freebayes way is to use CIGAR for a series of interlocking variants:
chr1 10 GTATATAGCGA GTATAT,GGCTGA CIGAR=6M5D,1M6D2M1I2M
This is the best way to describe both haplotypes at the same time. On this example, your attempt to describe both haplotypes will become more complex. I guess there should be ways with your method, but I can't think of one for now. With the current *
, we can describe one allele on one haplotype:
chr1 10 GTATATA G,GTATAT 2|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
PS: the major problem with the freebayes method is that it becomes awkward when there are a 6kb deletion on one haplotype and multiple small variants on the other haplotype. How do you deal with that in your representation?
The more I think of this problem, the stronger I feel we should forbid **AC
.
Another issue to take into account is that the current spec doesn't mention that the record where the overlapping deletion is described needs to appear before the current record, so would this be a legal VCF?
chr1 10 GTATATA G,* 2|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
This has less redundancy and looks more straight to the point, but of course makes haplotype reconstruction harder by having to look at later records.
As long as the spec is clear and easy to understand I'm ok with any approach. At the very least I think we should address the current wordings:
If we instead adopted the partial spanning notation, it should make clear the meaning of *
. From this (to me, surprising) sentence:
The T before the * is (and must be) determined by the next record.
What I understand is that the purpose of each *
is there to avoid redundancy, i.e. avoid putting a base that is described in other records, which is yet another different meaning of the already too overloaded *
. Maybe I'm wrong, but I don't see any fatal issue with the current notation that the partial spanning notation solves. The partial spanning one may be clearer on a few examples, but the added complexity makes it likely there will be more confusion about it in most cases.
I'm slightly in favour of just fixing the wording of the current simple approach.
The definition I'm suggesting is:
`*` indicates that this reference base is specified by an overlapping record
Note that this does't only apply to deletions, or upstream records (as currently required), in addition, there is no requirement to introduce a new type of 'symbolic' allele.
Using this definition, your example would be reported as:
chr1 10 GTATATA G,GTATA** 2|1
chr1 15 TAGCGA T,**G*GA 1|2
chr1 18 C CT,* 2|1
This has several advantages over the symbolic representation you show:
Just to stress that last point, let's look how you would represent your example without the first deletion:
chr1 15 TAGCGA T, TAGCTGA 1|2
chr1 18 C CT,* 2|1
So not only do you end up completely changing representation (going from symbolic *
to an explicit allele), you also end up duplicating the insertion and therefore using the FreeBayes representation anyway. Using my definition this becomes:
chr1 15 TAGCGA T, TAG*GA 1|2
chr1 18 C CT,* 2|1
Consistent with the full example. No duplicate information - and therefore less error prone. I really can't understand how you find this more complex or confusing?
You already point out the main reason that the FreeBayes representation isn't viable - it doesn't scale. It's also hard to read a complex cigar string, and some complex substitutions can't be properly represented with a cigar (e.g. micro-inversions).
I'd also suggest that it makes matching and annotating variants from variation databases harder since simple variation gets buried into long haplotypes. Consider one more example, a complex substitution and an SNV. The 'FreeBayes' way would be:
chr1 15 TAGCGA TCGCTA, TAGTGA 1|2
There's no way to 'decompose' this (without using .
) using the existing specification since *
only currently applies to deletions. It makes it far from obvious there's a SNV even in this small case. Using my suggested definition, this becomes:
chr1 15 TAGCGA TCGCTA, TAG*GA 1|2
chr1 18 C T,* 2|1
Now the SNV is clear, and can easily be annotated.
@jmmut I was thinking about this on a flight. Now for my example:
Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1: XXXXXXXXXGTATAT------XXXXX
H2: XXXXXXXXXG------GCTGAXXXXX
my preference is:
chr1 10 GTATATA G 0|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
Here 0
means the reference allele until the next record. If I am right, this is how we are actually using 0
in practice. @tfenne's original example follows the same logic. This representation describes one allele at a time. It is succinct and consistent.
[the current
*
representation] makes haplotype reconstruction harder by having to look at later records.
When there are overlapping variations, you anyway need to read multiple records to reconstruct the haplotypes. The key information base *
tells us is that there are overlapping records later on. You can trivially know this by reading the following records.
@lh3 I really don't like this. It's weird to have one rule for one direction and a completely different one for the other. You're suggesting entirely changing the definition of GT
and using a symbolic allele (*
), but I can still never be sure if a 0
in GT
actually means reference just from looking at the record - this will result in confusion and mistakes. It's not even consistent, my 'complex substitution' example would become
chr1 15 TAGCGA TCGCTA 1|0
chr1 18 C T 0|1
and we'd be right back to the situation without *
in the first place.
@jmmut This is at least consistent, but the advantages of base *
over symbolic *
remain:
*
must implicitly assume the reference (although this isn't clear from the specification) and forces me to look at other records and calculate overlaps.*
we easily end up having genotypes like chr1 18 C T,* 1|2|2
, making it look like the zygosity is 2, but where the two *
symbolic alleles actually refer to different overlapping alleles. This is unlikely to happen with base *
(we'd usually have 1|2|3
). The same problem for diploid could occur if joint calling multiple samples (e.g. chr1 18 C T,* 2|2
with *
referring to different alleles).The only downside (that I can think of) is that ALT
alleles are more verbose than with symbolic *
. I believe this is a price worth paying for the advantages. Note that is always possible to convert records from base *
to symbolic *
, but not the other way around, showing that you do loose information.
I believe what I described is just an elaboration of the current approach. It is simple to generate: describe each atomic allele individually. It is also simple to parse: reconstruct the haplotype alignment based on the current record, discard the part of the haplotype beyond the the next record and write a new allele. It is imperfect, but it gets the job done and is compatible with existing ecosystem.
Your complex substitution example is actually more about multi-allelic sites. For example:
chr1 10 A C,T 1|2
In bgt, I have long been using
chr1 10 A C,<M> 1|2
chr1 10 A T,<M> 2|1
Symbolic allele <M>
means a different allele. For deletions, it plays the same role as *
. If we like (not that I am proposing!), we can extend the definition of *
to substitutions. Then *
would be the same as <M>
in bgt.
On three haplotypes, we can use
chr1 15 A C,*,* 1|2|3
to indicate genotypes involving two overlapping deletions, if we really want. In practice, it doesn't matter either way. The parser can know the correct haplotypes based on the upstream records.
Several people in this thread thought base *
helps to reduce dependencies between records. Not quite. First, a competent VCF parser should validate across records and have to look at down/upstream anyway. Second, see this base *
example:
chr1 10 GTATATA G,GTATA** 2|1
chr1 15 TAGCGA T,**G*GA 1|2
chr1 18 C CT,* 2|1
At position 15, the parser needs pos 18 to construct the haplotype alignment, and when it comes to pos 18, it needs to memorize the record at pos 15. The is the result of the insertion. Furthermore, the redundancy across records is particularly concerning as it gives us more ways to create wrong records. I designed sam, bam, gfa and bcfv2. Redundancy like this is one of the first things I would strive to remove.
I strongly oppose any format (re)definition that requires any information that relies in any way on record ordering or for which the variant call for a single record cannot be determined by reading that record in isolation. Common operations that will break such a definition are:
There is definitely room for improvement in defining phasing information but breaking steps that are used in almost every downsteam processing pipeline is something I am strongly opposed to.
Somewhat of a pity we didn't get around to fixing this given it was raised this as a problem 5 years ago: https://github.com/samtools/hts-specs/issues/28
I agree with @d-cameron.
It seems that this discussion is about redefining VCF to incorporate a "graph style" representation of alleles and support breaking large haplotypes into simpler nested loci. If that is the intent, then let's carefully define the semantics AND document them in the specification. e.g. allowing overlapping records in VCF has been a nightmare for tool developers because everyone basically picks the semantics that makes the most sense to them at the time and assumes everyone else in the world should think about it the same way.
If we do wish to expand VCF semantics, then let's consider that most attempts at doing so have redefine either
Both of those redefinitions are incompatible with other interpretations of the standard. e.g. why can't '0' actually mean reference? e.g. GT of '1/.' has been used to indicate a "half-call" where ALT 1 is present at a diploid locus and a second allele cannot be determined. This is distinct from using the non-ref allele <*> because the uncalled allele may in fact be reference (just at an insufficient quality level to call).
@dancooke : Just as a data point, I understand that you consider the multi-allelic VCF representation as "unscalable", but I would probably be using Octopus now if it supported that VCF syntax. Having to re-do my entire clinical analysis pipeline for yet another experimental VCF dialect is a non-starter.
allowing overlapping records in VCF has been a nightmare for tool developers
Everyone is complaining about this, but no one finds a satisfactory way to eliminate overlapping records. The best attempt is freebayes' CIGAR, but that is close to being useless in the presence of long deletions.
everyone basically picks the semantics that makes the most sense to them at the time and assumes everyone else in the world should think about it the same way.
In my view, the solution to overlapping records is not to eliminate them, but to impose a consistent representation.
why can't '0' actually mean reference? e.g. GT of '1/.' has been used to indicate a "half-call" where ALT 1 is present at a diploid locus and a second allele cannot be determined.
Let's use my example again:
Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1: XXXXXXXXXGTATAT------XXXXX
H2: XXXXXXXXXG------GCTGAXXXXX
Since the introduction of *
, I have been writing VCF like:
chr1 10 GTATATA G 0|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
My impression is most tools use the same way? Are you suggesting replacing 0|1
to .|1
at pos 10? That is not technically right, either, because the first allele can be "determined". If this is not your suggestion, how do you write VCF for my example?
EDIT: hmm... Rereading @kevinjacobs-progenity's comment, I realize that he only pointed out the problems with both 0|1
and .|1
but didn't give a solution.
@kevinjacobs-progenity I agree with most of this. I ended up using .
as you describe only because the specification doesn't give me another option (if I want a consistent VCF). @lh3 Using 0
in this way is not the approach that I assumed, since the specification does not define it this way. Why then have *
at all if we're going to allow reference overlaps with 0
in some places?
Regarding 'multi-allelic'/MNV calls. Scalability is only half the reason; more fundamentally, I think that variant calls should be interpreted as mutation event calls. So If I report chr1 15 TAGCGA TCGCTA, TAGTGA 1|2
then I make the claim that a single mutational event was TAGCGA > TAGTGA
. Of course, in most cases, we do not know and cannot confirm which mutational event actually occurred, but we can still make an informed guess and add priors. If you don't care about such details and just want resulting haplotypes then that information is preserved with phase information.
@lh3 There's actually no need to 'look ahead' - you can just keep a record of which reference bases are yet to be determined and fill them in when you see the next record. I agree it's generally a good principle to reduce redundancy, but I think in this case it's worth the relatively small cost in order to avoid inconsistency and potentially reduce parsing errors.
At pos 10, neither .|1
nor 0|1
is technically correct. I more like 0|1
and believe most existing tools are using it, but I can live with .|1
. What is more important is a consensus to reduce the multiple ways to encode the same set of variants.
To close the loop: at one point, I was thinking to use something like the following at pos 10 (note that this is slightly different from the example above)
chr1 10 GTATATA G,GTATA 2|1
The problem is we normally assume the 2bp del on hap1 is placed at pos 11 (this is another problem with multi-allelic VCF), but here we want to put the 2bp del at pos 15, so we would need a CIGAR field to distinguish the two cases. The price to achieve technical correctness is too high. I wouldn't go this way. Base *
can also achieve technical correctness, but its price is even higher.
I went back to bgt and noticed that bgt converts my example to the following VCF:
chr1 10 GTATATA G,<M> 2|1
chr1 15 TAGCGA T,<M> 1|2
chr1 18 C CT,<M> 2|1
Here <M>
is a symbolic allele, standing for "miscellaneous". It plays the same role as *
for deletions, but can also refer to other types of alleles. If we redefine *
to be a "miscellaneous allele overlapping with the current record", we will get @jmmut's VCF:
chr1 10 GTATATA G,* 2|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
which would be technically correct under the new *
definition. With the current spec, though, the above is invalid. I am not sure if anyone is using this syntax in practice.
Everyone is complaining about this, but no one finds a satisfactory way to eliminate overlapping records. The best attempt is freebayes' CIGAR, but that is close to being useless in the presence of long deletions.
Coming from the SV world, I don't really appreciate why overlapping records are so problematic for indels. Is it because downstream analysis pipeline break on unphased overlapping records, because left-alignment of phased records results in an self-contradictory haplotype, because caller don't reliably report phasing blocks, because VCF phasing doesn't work in the presence of aneuploidy (this is a huge issue for somatic SVs), because the non-uniqueness of variant and haplotype representation makes cohort analysis and variant comparison too complex, because the 'optimal' representation (e.g. 2 SNVs vs 1 MNV call) is different for different downsteam analyses, or something else?
My naive assumption has been that SNV/indel callers determing haplotypes, and reporting the left-aligned set of variants corresponding to the set of SNVs and indels in the haplotype-reference alignment would give a representation without any of the above issues. Why doesn't this work?
@d-cameron It's not really indels specifically. There are three issues in my mind:
ACT > GCC
then I assume a single mutation event resulted in this MNV, but arguably the simpler explanation is two separate SNV mutations A > G
+ T > C
. On the other hand, if I need to explain the haplotype ATCGAA > TTCGAT
then probably the simplest explanation is a single micro-inversion event. The key point here is that we can argue about these differences in representation with some theoretical underpinning, and are in principle verifiable.chr1:100-201
, but also call an SNV at chr1:150
, do you really want me to report the SNV as a 100nt string? Sure you can add a CIGAR field, but what help is that really? What unnecessary allele lengths are we willing to tolerate in order to have multi-allelic calls? This is obviously subjective, and will result in different representations - without any theoretically justifiable argument. This makes annotating calls more difficult and less consistent.FORMAT/GT
) to refer to the REF
or ALT
alleles, or indicate the allele was 'missing' (.
). So let's suppose we call a heterozygous deletion overlapping with a SNV, and the reference elsewhere. How do we report this in VCF, like this?chr1 10 GTATATA G 0|1
chr1 15 T G 1|0
0
for the first haplotype while the second asserts an SNV. Likewise the first record asserts a deletion for the second haplotype while the second record asserts the reference. Another option is to replace the 0
s with .
s, but this doesn't accurately describe what we actually called. Later versions of VCF (4.3 I believe), added the *
symbol in an attempt to resolve this situation, but the definition was ambiguous and insufficient, leading to the discussion here.I see three viable solutions:
*
is symbolic. Extend the definition to specify overlaps with any allele in either direction. Make clear that *
implies the reference for non-overlapping bases. This will result in calls like:chr1 10 GTATATA G,* 2|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
*
indicates that this reference base is specified by an overlapping record. This will result in calls like:chr1 10 GTATATA G,GTATA** 2|1
chr1 15 TAGCGA T,**G*GA 1|2
chr1 18 C CT,* 2|1
*
to be replaced with a single *
, resulting in identical output as symbolic *
for fully overlapped alleles. The difference in parsing is trivial.*
to refer to overlaps in both directions means that it's not possible in general to reconstruct haplotypes without position information from the previous records. With distinct characters this wouldn't be required. Resulting output is like:chr1 10 GTATATA G,GTATA## 2|1
chr1 15 TAGCGA T,**G#GA 1|2
chr1 18 C CT,* 2|1
Note that the alignments for the examples are the same as first given by @lh3 here:
Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1: XXXXXXXXXGTATAT------XXXXX
H2: XXXXXXXXXG------GCTGAXXXXX
My preference is option 2 or 3, with the *
-only allele replacement. This results in the same output as option 1 most of the time, but is more explicit in the relatively rare case of partial overlaps - which I consider a positive. It's worth noting that the type of information provided by options 2 or 3 is what a VCF parser would need to figure out for option 1 anyway. So I don't think the argument that there are more ways to create invalid VCF's is a good one because by the same logic you can argue there are more ways to incorrectly parse the VCF in the case of option 1.
@dancooke, can you edit your last comment to show the underlying alignments so that people can comment on your proposal?
In general, this problem is too complex and I don't believe it can be solved within VCF. This is in essence an attempt to represent a graph of arbitrary complexity.
@pd3 done. I think it depends what you mean by "too complex" and "solved". If you mean that we can have a consistent VCF that accurately describes inferred mutation calls and haplotypes then I do believe the proposals I suggest offer solutions. Unless you can find a counter-example? Presumably the same problems exist for graph-based methods since essentially these just changes the reference in a sample specific manner? Then there's the additional problems of computing and applying population annotations without a common reference.
There are other solutions.
Redefine *
to be the ensemble of alleles overlapping with the current record. Pros: theoretically clean. Cons: may choke some existing parsers. This would be the top choice in the early days of VCF.
chr1 10 GTATATA G,* 2|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
Redefine 0
to be reference allele until the next record. Pros: least disruption to existing tools. This has been implicitly used. VCF parsers can all deal with this to different degrees. Cons: not theoretically clean. If we don't make a decision, we are effectively adopting this redefinition. A well-thought VCF parser needs to support this anyway for compatibility.
chr1 10 GTATATA G 0|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
Redefine .
to be a missing allele or an unspecified allele. Cons: not clean (one symbol, two meanings); not compatible (some parsers may take half genotypes as missing). I don't like it, but I don't strongly object to it.
chr1 10 GTATATA G .|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
Use the CIGAR field. Pros: no change to the spec. Cons: complex (few tools produce such VCF); misinterpreted by parsers not dealing with CIGAR. I don't like it, but I don't strongly object to it.
chr1 10 GTATATA G,GTATAT CIGAR=1M6D,6M1D 2|1
chr1 15 TAGCGA T,* .. 1|2
chr1 18 C CT,* .. 2|1
Base *
is out of question.
In general, this problem is too complex and I don't believe it can be solved within VCF.
I agree. My attempt here is to clarify and possibly to alleviate the issue without breaking the existing tools.
@lh3 Your solution 1 is the same as my solution 1 no? 2 is not a good solution as you have to redefine GT
. It also results in the weird situation where you call the reference after the next allele but actually this doesn't technically count because by the definition the next allele stops the reference allele. 3 is better than 2 but worse than 1 since it stretches the meaning of .
. 4 I strongly object to for reasons given in detail already - this would seriously limit the practicality and expressiveness of VCF.
Base * is out of question.
That's not for you to decide alone. You keep dismissing base *
out of hand as too complex but don't elaborate. I don't believe you even considered any of the detailed arguments I made. As I said, the 'complexity' you're presumably referring to is what any parser would need to work out anyway since the complexity is inherent in the problem. Having the VCF generator do this is preferable in my opinion as the information should already be available (having called the haplotypes), and it reduces the likelihood of parsing errors. Since this is undoubtably a complex problem, it deserves an explicit and consistent grammar with built-in sanity checks (via a small amount of redundancy). Only base *
provides that.
My attempt here is to clarify and possibly to alleviate the issue without breaking the existing tools.
Pretty much any solution we decide on will break some tools because most tools currently use different syntax. Support for upstream base *
was added to RTG Tools vcfeval without much issue (as I understand), so I don't think this is as painful as you make out.
The issue with the reference allele not actually meaning strict equality with the reference has been the defacto standard for ever and is evident in interaction between a deletion and a SNP:
POS REF ALT FORMAT s1 s2
10 GTATA G GT 0/1 0/0
12 A C,* GT 1/2 0/1
The 0/0 GT is technically "wrong" since the haplotype has a SNP in it (as evident in POS 12), so the haplotype is not REF.
I support the restriction of the ALT strings so that *
can only appear on its own (as, I think, was the original intention) and also the restriction of GT 0 to mean
"Agrees with REF except for cases where that disagreement should be represented in a different POS."
@yfarjoun According to that definition you don't even need *
at all. Also your example doesn't make sense. For a start, the genotypes aren't phased, in addition, the genotype at POS 12
for sample s2
should be 0/0
not 0/2
since there are no interacting deletions for s2
. I think you meant that 0/1
is technically wrong. This is kind of my point of why we need technical correctness. If we can't even get these simple examples right then something is seriously wrong...
thanks for spotting the typo @dancooke.
but no I really mean that 0/0 is technically wrong since strictly speaking it means that s2 has the reference allele (all 5 bases of it) on both haplotypes. and it doesn't.
I think that the phasing element is irrelevant, the interpretation of the alleles is the same.
@yfarjoun is a core GATK developer. He confirms my speculation that the overwhelming majority of VCFs, knowingly or not, have adopted:
chr1 10 GTATATA G 0|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
All these VCFs are technically wrong according to the current spec. However, if we redefine 0
in the spec, these VCFs will become technically correct and most existing tools will continue to work without modifications. It is a complete and easily the cheapest solution.
@dancooke, thank you for raising the issue, which made us (me at least) clear that the spec has flaws. Meanwhile, you need to stand from a maintainer's point of view. It is close to impossible to flip the VCF ecosystem just for octopus, no matter how terrific it is.
you need to stand from a maintainer's point of view
I really don't want to inflame this conversation, can the maintainers weigh in? @lbergelson, @pd3 & @jmmut
@yfarjoun Ok well now you've changed the example it makes more sense. This type of example does demonstrate another reason why changing the meaning of 0
in GT
is a bad idea - it results in misleading and inconsistent annotations:
POS REF ALT INFO FORMAT s1 s2
10 GTATA G GT AC=1;AN=4 0/1 0/0
12 A C,* GT AC=2,1;AN=4 1/2 0/1
So the first record suggests that the frequency of the reference allele is 0.75 but the second just 0.25 - but they describe the same haplotype. The results for each of my solutions:
POS REF ALT INFO FORMAT s1 s2
10 GTATA G,* AC=1,2;AN=4 GT 2|1 0|2
12 A C,* AC=2,1;AN=4 GT 1|2 0|1
POS REF ALT INFO FORMAT s1 s2
10 GTATA G,GT*TA AC=1,2;AN=4 GT 2|1 0|2
12 A C,* AC=2,1;AN=4 GT 1|2 0|1
POS REF ALT INFO FORMAT s1 s2
10 GTATA G,GT#TA AC=1,2;AN=4 GT 2|1 0|2
12 A C,* AC=2,1;AN=4 GT 1|2 0|1
In each case, all is well in the world. This type error will almost surely mislead analysis and waste time. I understand there's an initial cost associated with changing the current "ecosystem", but surly, now that the problem is understood and a solution proposed, is the time to fix it and potentially save many hours of otherwise future wasted efforts?
@lh3 I think it's a little disingenuous to suggest I'm arguing this simply for my own benefit. I didn't even raise this issue. I believe the changes I'm suggesting will serve the community - which is not isolated to the GATK ecosystem - best in the long term.
0.75 is indeed the reference frequency up to the next overlapping record. If we redefine 0
, everything will be consistent. Also, we most often care about the frequency of the 1
allele, which is not affected either way.
Meanwhile, this observation rejects your ./1
genotype as AN
will be wrong. Also, such multi-sample examples provide an argument against base *
. Think what you will do when there is a 100kb deletion in a large population.
@lh3 Ok but let's be honest, people do not process VCF that way - the vast majority of users won't be aware of such quirks of the specification; they will take the information provided in individual record annotations at face value, and won't consider overlaps at all (which I believe is the point @d-cameron was making). That's why it's so important to have it made explicit.
POS REF ALT INFO FORMAT s1
10 GTATA G GT 0|1
12 A C,* GT 1|2
In the example above, is the reference on the first haplotype for 13-14
called or not? According to the 'hack' 0
definition it technically isn't since the record at 12
'blocks' the previous record. So if I want to state those bases are called reference, I have to resort to adding additional reference call records covering those bases. Meaning that the base *
output:
POS REF ALT INFO FORMAT s1
10 GTATA G,GT*TA GT 2|1
12 A C,* 1|2
Is semantically equivalent to the hack 0
output:
POS REF ALT INFO FORMAT s1
10 GTATA G GT 0|1
12 A C,* GT 1|2
13 TA <*>,* GT 0|2
rather than the first example shown. This is simply unacceptable - and really does introduce completely unnecessary redundancy. This will also be a headache for code generating VCF. Then there's also the problem of whether the next record is filtered, does that still 'block' the previous record or not?
I never claimed ./1
was the correct solution. If there's a 100kb deletion in a large population then the output of base *
will be almost identical to the well-defined symbolic *
definition, only the deletion itself - and any other partial overlapping alleles - will contain additional ALT alleles.
IMO the question is within clarifying 0
as "reference until next record", or changing *
from "overlapping deletion" to "ovelapping allele", which correspond respectively to option 2 (of @lh3 here) and option 1 (of @dancooke here and here and of @lh3 here)
As much as I dislike it, I think the best pragmatic way to address this is to clarify 0
as "reference until next record", at the very least for 4.3 and previous versions, given that it's the de facto standard. I think the worst we could do is to allow different VCFs in the same version to mean different things.
Maybe it's not wise to change this even for future versions of VCF. Usually "several standards" is worse than "a single mediocre standard, with its flaws clearly stated" because it is more confusing to users, requires more knowledge about format history and requires more support from tools.
However, for the next VCF version, I would probably favour changing *
from "overlapping deletion" to "ovelapping allele", to be able to restrict the meaning of 0
to "true reference", basically replacing just the usage of those 0
-but-not-really to *
. I think this would allow correct representations to most of @dancooke concerns, while being familiar to users and tools. This option still has the problem of 0
meaning different things across versions, though.
I totally agree that a for new formats, the base-*
is an interesting take. It's less ambiguous and uses record interdependence in a well-defined way. It looks actually still correct when filtering out some variants, as some analysis do. However, on top of needing parsers support, I still see it as more complex. Another corner case example is that there are other issues open on how to merge VCFs, and adding more samples to a VCF with new variants that overlap deletions would require updating those deletions too, adding complexity due to the interdependence.
So moving to concrete actions, I humbly think what we should do for the current version at least is:
I'm sorry there's no solution here that pleases everyone.
@jmmut If we're really going to go with the re-define 0
option then the re-definition needs to be more like:
GT=0 means REF allele unless specified otherwise by downstream records.
Otherwise you have the problem I described above. Essentially then 0
adopts the meaning of a symbolic *
but for downstream records (in addition to meaning REF
).
I think it's also worth pointing out that - as far as I'm aware - the only two open source variant callers that use *
at all are GATK4 and Octopus. Symbolic *
support was added to GATK4 in version 4.0.9.0 (released on 20 September 2018), while Octopus has supported base *
since its release in October 2016. I therefore don't think it's accurate to say that using *
and 0
in the GATK4 (v4.0.9.0+) way is "standard". Probably the most widely used interpretation of 0
is actually REF unless specified otherwise by any other record
, since this is what every caller other than GATK4 (v4.0.9.0+) and Octopus use (to the best of my knowledge). This would technically need to be the re-definition of 0
for pre-v4.2 VCF (without *
), and could be used for v4.2-3 to achieve consistency across versions (although then *
in v4.2-3 becomes redundant). Furthermore, notable callers that don't use *
still use VCF v4.2 (e.g. DeepVariant), so the definition of 0
would need to be the same as v4.1 unless their calls are to break.
I think a fair compromise would be to re-define 0
for VCF up to and including v4.3, but the next version should use either the well defined symbolic *
or base *
. I think it would be reasonable to request that this new version of VCF is made available promptly (ideally alongside 4.3). This way, nothing breaks for the majority of tools that don't use *
, GATK can maintain 4.3 and update to the newer version at their own pace if desired, while Octopus and other tools wishing to adopt *
can take advantage of the newer version without having to conform to GATK's choices.
I can confirm that the second solution from @lh3's list is in my experience the most common interpretation of complex events that span multiple records.
The REF and ALT columns can be viewed as operations that need to be applied in order to obtain the actual haplotype sequence. In a set of overlapping rows the 0
allele is considered as "not changing the REF allele at this record", but should not be interpreted rigidly as the reference allele if other conflicting records follow (or precede).
Note that this interpretation has been in use for various purposes for a long time, e.g. when breaking multiallelic sites into biallelic sites in order to make them digestable by imputation programs.
I agree that the specification needs a clarification to prevent future confusion. I support @lh3's proposal, only would remove the requirement of ordered records as I think stating "overlapping" is more general and there have been strong concerns (now and in past) about the order requirement.
@pd3 Then *
should be removed from the specification as it would be made redundant:
chr1 10 GTATATA G 0|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT,* 2|1
chr1 10 GTATATA G 0|1
chr1 15 TAGCGA T 1|0
chr1 18 C CT,* 2|1
chr1 10 GTATATA G 0|1
chr1 15 TAGCGA T,* 1|2
chr1 18 C CT 0|1
chr1 10 GTATATA G 0|1
chr1 15 TAGCGA T 1|0
chr1 18 C CT 0|1
Would all describe the same set of haplotypes.
No, I don't think *
should be removed from the specification. The proposal is:
1) no new *
base
2) the *
allele remains to indicate overlapping alleles from different records
3) the interpretation of 0
suggested as above ("no REF change at this record"). In this meaning it is used only as a default fallback to resolve possible conflicts and ambiguities
@pd3 Are you suggesting that under your proposed re-definition of 0
and *
that the these examples are not semantically equivalent? If so I am truly lost.
the only two open source variant callers that use
*
at all are GATK4 and Octopus
Htsbox has been using *
since 2015 or 2016. This actually doesn't matter. Even before *
was introduced, 0
had never been the REF haplotype in fact.
I would probably favour changing
*
from "overlapping deletion" to "ovelapping allele", to be able to restrict the meaning of 0 to "true reference", basically replacing just the usage of those 0-but-not-really to*
.
This changes the semantics of VCF. A record in today's VCF effectively describes the interval maximally between this and the next record. With your change, a record will describe the interval of the current record. This is a subtle but important difference. Your proposal sounds better in theory, but will have problems in practice. Suppose we have a 100kb deletion in a population. If 0
only means the reference haplotype, we will be unable to determine the number of 0
s with short reads because we won't be able to phase through this 100kb region in diploid samples without this deletion. Today's VCF doesn't have this issue because it doesn't require haplotype information or phasing.
Earlier I said "[redefining *
] would be the top choice in the early days of VCF". Now I am in favor of the current practice, which is the one you and @pd3 have described.
@pd3
the * allele remains to indicate overlapping alleles from different records
the spec says "overlapping deletion", did you mean change to "overlapping allele" or remain as overlapping deletion?
@jmmut Yes, I meant "overlapping allele". Can't think of any harm it might cause, happy to be proven wrong though.
@pd3 Then I'm lost too because this is the very issue that we were trying to avoid by staying with the "0 means not changing the REF allele at this record". This sounds to me doing both options I described, and having the disadvantages of both. (Do the rest think it's ok to change the spec to both "0 means not changing the REF allele at this record" and *
to mean overlapping allele? this would mean that you can not state a "true reference" even if you use *
)
@lh3 I'm sorry but I don't understand what problem this scenario represents. Could you elaborate, please? Would this still apply if we did both changes?
Suppose we have a 100kb deletion in a population. If 0 only means the reference haplotype, we will be unable to determine the number of 0s with short reads because we won't be able to phase through this 100kb region in diploid samples without this deletion.
Also, with the "0 means not changing the REF allele at this record" interpretation, I don't understand (just as @dancooke) if these are semantically equivalent to @pd3 or not, and if yes, why *
was introduced in the first place. If they are not equivalent, then I don't know which one is the correct one nor why.
The VCF spec contains this text on ALTs:
I think probably the intention here was to allow either
*
or[ACGTN]+
, but not a mixture of the two. But as written it theoretically allows alleles like*A*C*G*T*
, but has nothing to say on how such alleles should be interpreted.I've recently come across VCFs generated by Octopus that contain alleles that start with a
*
and end with base sequence, e.g.*AC
. This is technically valid VCF, but I suspect most tooling won't support it (GATK tools fail on it). In their case they seem to be using it to indicate that the anchor base in a deletion record is covered by an upstream deletion, but not the whole allele.I think the spec should either be updated to clarify this kind of usage or if the horse is already out of the gate, prohibit it.