VCF spec allows ALTs that are mixes of * and bases, but doesn't define how to interpret them

tfenne commented 5 years ago

The VCF spec contains this text on ALTs:

Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (“\<ID>”) or a breakend replacement string as described in the section on breakends. The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion

I think probably the intention here was to allow either * or [ACGTN]+, but not a mixture of the two. But as written it theoretically allows alleles like *A*C*G*T*, but has nothing to say on how such alleles should be interpreted.

I've recently come across VCFs generated by Octopus that contain alleles that start with a * and end with base sequence, e.g. *AC. This is technically valid VCF, but I suspect most tooling won't support it (GATK tools fail on it). In their case they seem to be using it to indicate that the anchor base in a deletion record is covered by an upstream deletion, but not the whole allele.

I think the spec should either be updated to clarify this kind of usage or if the horse is already out of the gate, prohibit it.

pd3 commented 5 years ago

Yes, the intention was to allow either * or [ACGTN]+ and the combination should be prohibited.

tfenne commented 5 years ago

Some related discussion here that I missed previously: https://github.com/samtools/hts-specs/issues/151 And an issue/question in Octopus which uses * followed by bases to represent partially spanned alleles: https://github.com/luntergroup/octopus/issues/75

tfenne commented 5 years ago

@lbergelson, @pd3 & @jmmut pinging you guys since you are the currently listed VCF maintainers. Is there a path to making decision here in the near future? It's becoming a little awkward since Octopus uses alleles of \**[ACGT]+ and currently HTSJDK cannot read VCFs containing those alleles, which makes it a) impossible to use Picard of GATK tools with Octopus VCFs and b) near-impossible to write custom code that uses HTSJDK to process Octopus VCFs (see https://github.com/samtools/htsjdk/pull/1413).

My sense is that since this isn't explicitly dis-allowed by the spec and that @dancooke has shown a good use case for alleles that start with * and continue with bases, and that it's never been disallowed by the specification, we should probably just formalize that it's ok and give examples of how it can be used to represent partially-spanned alleles. But the worst case is to have uses like this in the wild and reference implementations that don't want to support that syntax even though it's not in violation of the spec.

If folks are on board with formalizing the usage from octopus I'm happy to put together a PR to the spec. And if that's a non-starter could someone else make a PR to clarify the restrictions on use of * please?

lbergelson commented 5 years ago

@tfenne I'm not totally sold on this. It's not explicitly disallowed by the spec, but it was definitely not intended.

I see how it's useful for figuring out a partially spanned allele, but it seems like it could add a lot of complication to an already error prone and poorly supported feature.

Is the intent to support alleles starting with * or do you want to allow * at any position in the read? We could potentially have very complicated alleles overlapping multiple things specified at other points in the vcf. It seems like supporting this will open up another set of equivalent ways to represent the same haplotypes in a vcf.

dancooke commented 5 years ago

@lbergelson I would argue that the 'symbolic' interpretation adds even more complexity as it adds a completely distinct concept. The base-*interpretation simply extends the set of allowed ALT bases by one, essentially meaning "already reported upstream".

I think the definition I gave implicitly requires any * to be a the head of the allele - otherwise the allele that is causing the non-head-* would be reported in a downstream record, contradicting the definition.

tfenne commented 5 years ago

@lbergelson What I really want is for the spec to be 100% clear on what is and isn't supported, and for HTSJDK to be in line with that, and for Octopus to generate VCFs that are spec compliant. I tend to think clarifying the spec to allow one or more *s followed by 0 or more bases is a good approach as it doesn't significantly complicate what's there now, and it doesn't penalize @dancooke for implementing a feature in a spec-compliant (albeit underspecified) way. But mostly I'm just looking for alignment between the spec, Octopus and HTSJDK.

lh3 commented 5 years ago

I prefer to disallow an ALT like **AAC. Due to the way VCF is designed, we often need to look at the previous records to make the right decision at the current record. Having **AAC doesn't solve this problem and further introduces additional conceptual and technical complexity, and more ways to generate inconsistent VCFs.

dancooke commented 5 years ago

There is actually a more fatal problem with the symbolic *. Considering again the example given by @tfenne:

# Symbolic * representation
chr1  1000  GTATATA G         ... 0|1
chr1  1005  TAGCGA  T,*       ... 1|2

# Base * representation
chr1  1000  GTATATA G         ... 0|1
chr1  1005  TAGCGA  T,**GCGA  ... 1|2

Both of these examples are actually inconsistent as the first record states the reference (A) on the first haplotype at chr1:1006 while the second states that this base is deleted. Octopus currently deals with this by using .:

chr1  1000  GTATATA G         ... .|1
chr1  1005  TAGCGA  T,**GCGA  ... 1|2

But this isn't particularly satisfactory as it doesn't make a strong claim about the reference between chr1:1000-1004. I don't see a way around this using the symbolic representation. However, it's fairly simple to extend the definition of base-* to something like:

This base is stated by an overlapping record.

Then the output would be

chr1  1000  GTATATA G,GTATA**         ... 2|1
chr1  1005  TAGCGA  T,**GCGA  ... 1|2

Which is consistent and makes it simple to reconstruct haplotypes as there no need to remember position information between overlapping records.

lh3 commented 5 years ago

The haplotype alignment is:

Pos: 1234567890123456789012345
Ref: XXXXXXXXXGTATATAGCGAXXXXX
H1:  XXXXXXXXXGTATAT-----XXXXX
H2:  XXXXXXXXXG------GCGAXXXXX

Your final VCF is wrong. The first line should be:

chr1  1000  GTATATA G,GTATAT*         ... 2|1

A VCF parser can only detect this error when it comes to the second line. To validate a VCF, you always need to look at multiple lines. Your proposal is not making things simpler. Furthermore, your proposal allows inconsistencies between the two lines. Wrong second lines could look like:

chr1  1005  TAGCGA  T,**GCA  ... 1|2

or

chr1  1005  TAGCGA  T,****CA  ... 1|2

A proper design should try to minimize such errors.

The right way to represent this alignment is:

chr1  1000  GTATATA G,GTATAT  2|1
chr1  1005  TAGCGA  T,*       1|2

It is easy enough to reconstruct the alignment. There are fewer ways to create wrong VCFs.

dancooke commented 5 years ago

@lh3

The first line should be:

chr1 1000 GTATATA G,GTATAT* ... 2|1

Nope. The T before the * is (and must be) determined by the next record.

The right way to represent this alignment is:

chr1 1000 GTATATA G,GTATAT 2|1 chr1 1005 TAGCGA T,* 1|2

Completely disagree. You're permitting duplicate information in one direction but not the other. Not only is this inconsistent, it's also very confusing - It looks like the first record is calling a 1bp deletion, but you don't know that this isn't the case until you see the next record, and the first gives no indication you must do this (unlike the way I proposed).

There are fewer ways to create wrong VCFs.

I think this is a poor argument. For a start, the number of tools generating VCFs de novo is small compared with the number that parse/manipulate them, and I'd strongly argue that the base * representation makes parsing VCFs less error prone than symbolic * (or your proposed "right way") as * has clear meaning - it essentially says "ignore this base". Moreover, if you're generating inconsistent VCFs with this representation then probably you have some deeper problem with your algorithms internals. I'd rather have an output format that forces me to address this rather than allowing me to sweep it under the rug.

dancooke commented 5 years ago

Would also be good to hear @Lenbok's thoughts on this - he added support for base * to RTG Tools vcfeval.

lh3 commented 5 years ago

Nope. The T before the * is (and must be) determined by the next record.

Then we are interpreting your * differently (I see your * as padding). This complicates an already complex feature.

It is worth mentioning the freebayes way to encode this VCF. See a more complex example:

Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1:  XXXXXXXXXGTATAT------XXXXX
H2:  XXXXXXXXXG------GCTGAXXXXX

The freebayes way is to use CIGAR for a series of interlocking variants:

chr1  10  GTATATAGCGA GTATAT,GGCTGA  CIGAR=6M5D,1M6D2M1I2M

This is the best way to describe both haplotypes at the same time. On this example, your attempt to describe both haplotypes will become more complex. I guess there should be ways with your method, but I can't think of one for now. With the current *, we can describe one allele on one haplotype:

chr1  10  GTATATA G,GTATAT  2|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

PS: the major problem with the freebayes method is that it becomes awkward when there are a 6kb deletion on one haplotype and multiple small variants on the other haplotype. How do you deal with that in your representation?

The more I think of this problem, the stronger I feel we should forbid **AC.

jmmut commented 5 years ago

Another issue to take into account is that the current spec doesn't mention that the record where the overlapping deletion is described needs to appear before the current record, so would this be a legal VCF?

chr1  10  GTATATA G,*  2|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

This has less redundancy and looks more straight to the point, but of course makes haplotype reconstruction harder by having to look at later records.

As long as the spec is clear and easy to understand I'm ok with any approach. At the very least I think we should address the current wordings:

Ambiguity on the "Strings made up of the bases A,C,G,T,N,*" part.
Ambiguity on whether the overlapped allele has to be described before or can appear later.

If we instead adopted the partial spanning notation, it should make clear the meaning of *. From this (to me, surprising) sentence:

The T before the * is (and must be) determined by the next record.

What I understand is that the purpose of each * is there to avoid redundancy, i.e. avoid putting a base that is described in other records, which is yet another different meaning of the already too overloaded *. Maybe I'm wrong, but I don't see any fatal issue with the current notation that the partial spanning notation solves. The partial spanning one may be clearer on a few examples, but the added complexity makes it likely there will be more confusion about it in most cases.

I'm slightly in favour of just fixing the wording of the current simple approach.

dancooke commented 5 years ago

The definition I'm suggesting is:

`*` indicates that this reference base is specified by an overlapping record

Note that this does't only apply to deletions, or upstream records (as currently required), in addition, there is no requirement to introduce a new type of 'symbolic' allele.

Using this definition, your example would be reported as:

chr1  10  GTATATA G,GTATA**  2|1
chr1  15  TAGCGA  T,**G*GA  1|2
chr1  18  C  CT,*  2|1

This has several advantages over the symbolic representation you show:

It's clear that the first record is not calling a 1bp deletion on the second haplotype.
The reference is explicitly being called (it's not clear that this is the case with the current symbolic definition).
We know exactly which other positions we need to look at to complete the haplotypes.
It doesn't use one set of rules for upstream events and another for downstream events.
There's no duplicate information.

Just to stress that last point, let's look how you would represent your example without the first deletion:

chr1  15  TAGCGA  T, TAGCTGA       1|2
chr1  18  C       CT,*      2|1

So not only do you end up completely changing representation (going from symbolic * to an explicit allele), you also end up duplicating the insertion and therefore using the FreeBayes representation anyway. Using my definition this becomes:

chr1  15  TAGCGA  T, TAG*GA       1|2
chr1  18  C       CT,*      2|1

Consistent with the full example. No duplicate information - and therefore less error prone. I really can't understand how you find this more complex or confusing?

You already point out the main reason that the FreeBayes representation isn't viable - it doesn't scale. It's also hard to read a complex cigar string, and some complex substitutions can't be properly represented with a cigar (e.g. micro-inversions).

I'd also suggest that it makes matching and annotating variants from variation databases harder since simple variation gets buried into long haplotypes. Consider one more example, a complex substitution and an SNV. The 'FreeBayes' way would be:

chr1  15  TAGCGA  TCGCTA, TAGTGA       1|2

There's no way to 'decompose' this (without using .) using the existing specification since * only currently applies to deletions. It makes it far from obvious there's a SNV even in this small case. Using my suggested definition, this becomes:

chr1  15  TAGCGA  TCGCTA, TAG*GA       1|2
chr1  18  C       T,*      2|1

Now the SNV is clear, and can easily be annotated.

lh3 commented 5 years ago

@jmmut I was thinking about this on a flight. Now for my example:

Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1:  XXXXXXXXXGTATAT------XXXXX
H2:  XXXXXXXXXG------GCTGAXXXXX

my preference is:

chr1  10  GTATATA G         0|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

Here 0 means the reference allele until the next record. If I am right, this is how we are actually using 0 in practice. @tfenne's original example follows the same logic. This representation describes one allele at a time. It is succinct and consistent.

[the current * representation] makes haplotype reconstruction harder by having to look at later records.

When there are overlapping variations, you anyway need to read multiple records to reconstruct the haplotypes. The key information base * tells us is that there are overlapping records later on. You can trivially know this by reading the following records.

dancooke commented 5 years ago

@lh3 I really don't like this. It's weird to have one rule for one direction and a completely different one for the other. You're suggesting entirely changing the definition of GT and using a symbolic allele (*), but I can still never be sure if a 0 in GT actually means reference just from looking at the record - this will result in confusion and mistakes. It's not even consistent, my 'complex substitution' example would become

chr1  15  TAGCGA  TCGCTA       1|0
chr1  18  C       T      0|1

and we'd be right back to the situation without * in the first place.

@jmmut This is at least consistent, but the advantages of base * over symbolic * remain:

I can easily see which bases are called reference; symbolic * must implicitly assume the reference (although this isn't clear from the specification) and forces me to look at other records and calculate overlaps.
Haplotype construction is simpler and less error prone. It also helps spot bugs in code generating VCFs as sanity checking VCFs for consistency is simple.
Genotype zygosity is better preserved. For example, with symbolic * we easily end up having genotypes like chr1 18 C T,* 1|2|2, making it look like the zygosity is 2, but where the two * symbolic alleles actually refer to different overlapping alleles. This is unlikely to happen with base * (we'd usually have 1|2|3). The same problem for diploid could occur if joint calling multiple samples (e.g. chr1 18 C T,* 2|2 with * referring to different alleles).

The only downside (that I can think of) is that ALT alleles are more verbose than with symbolic *. I believe this is a price worth paying for the advantages. Note that is always possible to convert records from base * to symbolic *, but not the other way around, showing that you do loose information.

lh3 commented 5 years ago

I believe what I described is just an elaboration of the current approach. It is simple to generate: describe each atomic allele individually. It is also simple to parse: reconstruct the haplotype alignment based on the current record, discard the part of the haplotype beyond the the next record and write a new allele. It is imperfect, but it gets the job done and is compatible with existing ecosystem.

Your complex substitution example is actually more about multi-allelic sites. For example:

chr1  10  A  C,T  1|2

In bgt, I have long been using

chr1  10  A  C,<M>  1|2
chr1  10  A  T,<M>  2|1

Symbolic allele <M> means a different allele. For deletions, it plays the same role as *. If we like (not that I am proposing!), we can extend the definition of * to substitutions. Then * would be the same as <M> in bgt.

On three haplotypes, we can use

chr1  15  A  C,*,*  1|2|3

to indicate genotypes involving two overlapping deletions, if we really want. In practice, it doesn't matter either way. The parser can know the correct haplotypes based on the upstream records.

Several people in this thread thought base * helps to reduce dependencies between records. Not quite. First, a competent VCF parser should validate across records and have to look at down/upstream anyway. Second, see this base * example:

chr1  10  GTATATA G,GTATA**  2|1
chr1  15  TAGCGA  T,**G*GA   1|2
chr1  18  C       CT,*       2|1

At position 15, the parser needs pos 18 to construct the haplotype alignment, and when it comes to pos 18, it needs to memorize the record at pos 15. The is the result of the insertion. Furthermore, the redundancy across records is particularly concerning as it gives us more ways to create wrong records. I designed sam, bam, gfa and bcfv2. Redundancy like this is one of the first things I would strive to remove.

d-cameron commented 5 years ago

I strongly oppose any format (re)definition that requires any information that relies in any way on record ordering or for which the variant call for a single record cannot be determined by reading that record in isolation. Common operations that will break such a definition are:

filtering a VCF (!)
left/right/centre aligning calls
sorting a VCF (due to weakly defined order)
taking a positional subset of a VCF (e.g. subset to exonic variants)

There is definitely room for improvement in defining phasing information but breaking steps that are used in almost every downsteam processing pipeline is something I am strongly opposed to.

Somewhat of a pity we didn't get around to fixing this given it was raised this as a problem 5 years ago: https://github.com/samtools/hts-specs/issues/28

kevinjacobs-progenity commented 5 years ago

I agree with @d-cameron.

It seems that this discussion is about redefining VCF to incorporate a "graph style" representation of alleles and support breaking large haplotypes into simpler nested loci. If that is the intent, then let's carefully define the semantics AND document them in the specification. e.g. allowing overlapping records in VCF has been a nightmare for tool developers because everyone basically picks the semantics that makes the most sense to them at the time and assumes everyone else in the world should think about it the same way.

If we do wish to expand VCF semantics, then let's consider that most attempts at doing so have redefine either

'.' in GT to mean "no allele specified" as opposed to the more traditional interpretation of "no allele called"
'0' in GT to mean "maybe reference unless another record suggests otherwise" as opposed to the traditional interpretation of "this allele is reference"

Both of those redefinitions are incompatible with other interpretations of the standard. e.g. why can't '0' actually mean reference? e.g. GT of '1/.' has been used to indicate a "half-call" where ALT 1 is present at a diploid locus and a second allele cannot be determined. This is distinct from using the non-ref allele <*> because the uncalled allele may in fact be reference (just at an insufficient quality level to call).

@dancooke : Just as a data point, I understand that you consider the multi-allelic VCF representation as "unscalable", but I would probably be using Octopus now if it supported that VCF syntax. Having to re-do my entire clinical analysis pipeline for yet another experimental VCF dialect is a non-starter.

lh3 commented 5 years ago

allowing overlapping records in VCF has been a nightmare for tool developers

Everyone is complaining about this, but no one finds a satisfactory way to eliminate overlapping records. The best attempt is freebayes' CIGAR, but that is close to being useless in the presence of long deletions.

everyone basically picks the semantics that makes the most sense to them at the time and assumes everyone else in the world should think about it the same way.

In my view, the solution to overlapping records is not to eliminate them, but to impose a consistent representation.

why can't '0' actually mean reference? e.g. GT of '1/.' has been used to indicate a "half-call" where ALT 1 is present at a diploid locus and a second allele cannot be determined.

Let's use my example again:

Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1:  XXXXXXXXXGTATAT------XXXXX
H2:  XXXXXXXXXG------GCTGAXXXXX

Since the introduction of *, I have been writing VCF like:

chr1  10  GTATATA G         0|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

My impression is most tools use the same way? Are you suggesting replacing 0|1 to .|1 at pos 10? That is not technically right, either, because the first allele can be "determined". If this is not your suggestion, how do you write VCF for my example?

EDIT: hmm... Rereading @kevinjacobs-progenity's comment, I realize that he only pointed out the problems with both 0|1 and .|1 but didn't give a solution.

dancooke commented 5 years ago

@kevinjacobs-progenity I agree with most of this. I ended up using . as you describe only because the specification doesn't give me another option (if I want a consistent VCF). @lh3 Using 0 in this way is not the approach that I assumed, since the specification does not define it this way. Why then have * at all if we're going to allow reference overlaps with 0 in some places?

Regarding 'multi-allelic'/MNV calls. Scalability is only half the reason; more fundamentally, I think that variant calls should be interpreted as mutation event calls. So If I report chr1 15 TAGCGA TCGCTA, TAGTGA 1|2 then I make the claim that a single mutational event was TAGCGA > TAGTGA. Of course, in most cases, we do not know and cannot confirm which mutational event actually occurred, but we can still make an informed guess and add priors. If you don't care about such details and just want resulting haplotypes then that information is preserved with phase information.

@lh3 There's actually no need to 'look ahead' - you can just keep a record of which reference bases are yet to be determined and fill them in when you see the next record. I agree it's generally a good principle to reduce redundancy, but I think in this case it's worth the relatively small cost in order to avoid inconsistency and potentially reduce parsing errors.

lh3 commented 5 years ago

At pos 10, neither .|1 nor 0|1 is technically correct. I more like 0|1 and believe most existing tools are using it, but I can live with .|1. What is more important is a consensus to reduce the multiple ways to encode the same set of variants.

To close the loop: at one point, I was thinking to use something like the following at pos 10 (note that this is slightly different from the example above)

chr1  10  GTATATA G,GTATA         2|1

The problem is we normally assume the 2bp del on hap1 is placed at pos 11 (this is another problem with multi-allelic VCF), but here we want to put the 2bp del at pos 15, so we would need a CIGAR field to distinguish the two cases. The price to achieve technical correctness is too high. I wouldn't go this way. Base * can also achieve technical correctness, but its price is even higher.

lh3 commented 5 years ago

I went back to bgt and noticed that bgt converts my example to the following VCF:

chr1  10  GTATATA G,<M>  2|1
chr1  15  TAGCGA  T,<M>  1|2
chr1  18  C       CT,<M> 2|1

Here <M> is a symbolic allele, standing for "miscellaneous". It plays the same role as * for deletions, but can also refer to other types of alleles. If we redefine * to be a "miscellaneous allele overlapping with the current record", we will get @jmmut's VCF:

chr1  10  GTATATA G,*   2|1
chr1  15  TAGCGA  T,*   1|2
chr1  18  C       CT,*  2|1

which would be technically correct under the new * definition. With the current spec, though, the above is invalid. I am not sure if anyone is using this syntax in practice.

d-cameron commented 5 years ago

Everyone is complaining about this, but no one finds a satisfactory way to eliminate overlapping records. The best attempt is freebayes' CIGAR, but that is close to being useless in the presence of long deletions.

Coming from the SV world, I don't really appreciate why overlapping records are so problematic for indels. Is it because downstream analysis pipeline break on unphased overlapping records, because left-alignment of phased records results in an self-contradictory haplotype, because caller don't reliably report phasing blocks, because VCF phasing doesn't work in the presence of aneuploidy (this is a huge issue for somatic SVs), because the non-uniqueness of variant and haplotype representation makes cohort analysis and variant comparison too complex, because the 'optimal' representation (e.g. 2 SNVs vs 1 MNV call) is different for different downsteam analyses, or something else?

My naive assumption has been that SNV/indel callers determing haplotypes, and reporting the left-aligned set of variants corresponding to the set of SNVs and indels in the haplotype-reference alignment would give a representation without any of the above issues. Why doesn't this work?

dancooke commented 5 years ago

@d-cameron It's not really indels specifically. There are three issues in my mind:

There is more than one way to explain the same haplotype. For example, it's always possible to explain a haplotype resulting from a SNV with two indels (an insertion and deletion). This 'issue' is what lead to the development of several tools (e.g. RTG Tools vcfeval) to compare variant calls on a haplotype level - effectively ignoring representation differences. However, if we assume variant calls represent mutation events - which I believe should be the case - then you can begin to talk about an optimal representation. In most cases, we will never be able to determine what that optimal is, but our prior will usually point to the 'simplest' solution (Occam's razor and all). Simplest just means the solution with least assumptions, so if I say the mutation event was ACT > GCC then I assume a single mutation event resulted in this MNV, but arguably the simpler explanation is two separate SNV mutations A > G + T > C. On the other hand, if I need to explain the haplotype ATCGAA > TTCGAT then probably the simplest explanation is a single micro-inversion event. The key point here is that we can argue about these differences in representation with some theoretical underpinning, and are in principle verifiable.
Calling multi-allelic loci is not scalable or practical. Even if we ignore point 1 and decide to conflate mutation calls and haplotype calls, there comes a point where it's just not practical to read and interpret the resulting call set. For example, if I call a heterozygous deletion of chr1:100-201, but also call an SNV at chr1:150, do you really want me to report the SNV as a 100nt string? Sure you can add a CIGAR field, but what help is that really? What unnecessary allele lengths are we willing to tolerate in order to have multi-allelic calls? This is obviously subjective, and will result in different representations - without any theoretically justifiable argument. This makes annotating calls more difficult and less consistent.
Overlapping variant calls result in inconsistent genotype calls. So having accepted either of the two points above, we need to face the reality that we'll end up with individual variant records in our VCF output that describe alleles that overlap with one another. However, the first versions of VCF only allowed genotype calls (i.e. FORMAT/GT) to refer to the REF or ALT alleles, or indicate the allele was 'missing' (.). So let's suppose we call a heterozygous deletion overlapping with a SNV, and the reference elsewhere. How do we report this in VCF, like this?

chr1  10  GTATATA G   0|1
chr1  15  T G   1|0

But this is inconsistent. The first record asserts the reference 0 for the first haplotype while the second asserts an SNV. Likewise the first record asserts a deletion for the second haplotype while the second record asserts the reference. Another option is to replace the 0s with .s, but this doesn't accurately describe what we actually called. Later versions of VCF (4.3 I believe), added the * symbol in an attempt to resolve this situation, but the definition was ambiguous and insufficient, leading to the discussion here.

dancooke commented 5 years ago

I see three viable solutions:

Clarify that * is symbolic. Extend the definition to specify overlaps with any allele in either direction. Make clear that * implies the reference for non-overlapping bases. This will result in calls like:

chr1  10  GTATATA G,*       2|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

* indicates that this reference base is specified by an overlapping record. This will result in calls like:

chr1  10  GTATATA G,GTATA**  2|1
chr1  15  TAGCGA  T,**G*GA  1|2
chr1  18  C  CT,*  2|1

It's possible to extend this definition to allow alleles consisting only of * to be replaced with a single *, resulting in identical output as symbolic * for fully overlapped alleles. The difference in parsing is trivial.

An extension to option 2 to include a distinct chapter for downstream overlaps. I realised allowing * to refer to overlaps in both directions means that it's not possible in general to reconstruct haplotypes without position information from the previous records. With distinct characters this wouldn't be required. Resulting output is like:

chr1  10  GTATATA G,GTATA##  2|1
chr1  15  TAGCGA  T,**G#GA  1|2
chr1  18  C  CT,*  2|1

This would enable an additional sanity check when parsing if position information from the previous record is provided.

Note that the alignments for the examples are the same as first given by @lh3 here:

Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1:  XXXXXXXXXGTATAT------XXXXX
H2:  XXXXXXXXXG------GCTGAXXXXX

My preference is option 2 or 3, with the *-only allele replacement. This results in the same output as option 1 most of the time, but is more explicit in the relatively rare case of partial overlaps - which I consider a positive. It's worth noting that the type of information provided by options 2 or 3 is what a VCF parser would need to figure out for option 1 anyway. So I don't think the argument that there are more ways to create invalid VCF's is a good one because by the same logic you can argue there are more ways to incorrectly parse the VCF in the case of option 1.

pd3 commented 5 years ago

@dancooke, can you edit your last comment to show the underlying alignments so that people can comment on your proposal?

In general, this problem is too complex and I don't believe it can be solved within VCF. This is in essence an attempt to represent a graph of arbitrary complexity.

dancooke commented 5 years ago

@pd3 done. I think it depends what you mean by "too complex" and "solved". If you mean that we can have a consistent VCF that accurately describes inferred mutation calls and haplotypes then I do believe the proposals I suggest offer solutions. Unless you can find a counter-example? Presumably the same problems exist for graph-based methods since essentially these just changes the reference in a sample specific manner? Then there's the additional problems of computing and applying population annotations without a common reference.

lh3 commented 5 years ago

There are other solutions.

Redefine * to be the ensemble of alleles overlapping with the current record. Pros: theoretically clean. Cons: may choke some existing parsers. This would be the top choice in the early days of VCF.
```
chr1  10  GTATATA G,*       2|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1
```
Redefine 0 to be reference allele until the next record. Pros: least disruption to existing tools. This has been implicitly used. VCF parsers can all deal with this to different degrees. Cons: not theoretically clean. If we don't make a decision, we are effectively adopting this redefinition. A well-thought VCF parser needs to support this anyway for compatibility.
```
chr1  10  GTATATA G         0|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1
```
Redefine . to be a missing allele or an unspecified allele. Cons: not clean (one symbol, two meanings); not compatible (some parsers may take half genotypes as missing). I don't like it, but I don't strongly object to it.
```
chr1  10  GTATATA G         .|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1
```
Use the CIGAR field. Pros: no change to the spec. Cons: complex (few tools produce such VCF); misinterpreted by parsers not dealing with CIGAR. I don't like it, but I don't strongly object to it.
```
chr1  10  GTATATA G,GTATAT CIGAR=1M6D,6M1D 2|1
chr1  15  TAGCGA  T,*    ..   1|2
chr1  18  C       CT,*   ..   2|1
```

Base * is out of question.

In general, this problem is too complex and I don't believe it can be solved within VCF.

I agree. My attempt here is to clarify and possibly to alleviate the issue without breaking the existing tools.

dancooke commented 5 years ago

@lh3 Your solution 1 is the same as my solution 1 no? 2 is not a good solution as you have to redefine GT. It also results in the weird situation where you call the reference after the next allele but actually this doesn't technically count because by the definition the next allele stops the reference allele. 3 is better than 2 but worse than 1 since it stretches the meaning of .. 4 I strongly object to for reasons given in detail already - this would seriously limit the practicality and expressiveness of VCF.

Base * is out of question.

That's not for you to decide alone. You keep dismissing base * out of hand as too complex but don't elaborate. I don't believe you even considered any of the detailed arguments I made. As I said, the 'complexity' you're presumably referring to is what any parser would need to work out anyway since the complexity is inherent in the problem. Having the VCF generator do this is preferable in my opinion as the information should already be available (having called the haplotypes), and it reduces the likelihood of parsing errors. Since this is undoubtably a complex problem, it deserves an explicit and consistent grammar with built-in sanity checks (via a small amount of redundancy). Only base * provides that.

My attempt here is to clarify and possibly to alleviate the issue without breaking the existing tools.

Pretty much any solution we decide on will break some tools because most tools currently use different syntax. Support for upstream base * was added to RTG Tools vcfeval without much issue (as I understand), so I don't think this is as painful as you make out.

yfarjoun commented 5 years ago

The issue with the reference allele not actually meaning strict equality with the reference has been the defacto standard for ever and is evident in interaction between a deletion and a SNP:

POS REF   ALT FORMAT s1  s2
10  GTATA G   GT     0/1 0/0
12  A     C,* GT     1/2 0/1

The 0/0 GT is technically "wrong" since the haplotype has a SNP in it (as evident in POS 12), so the haplotype is not REF.

I support the restriction of the ALT strings so that * can only appear on its own (as, I think, was the original intention) and also the restriction of GT 0 to mean

"Agrees with REF except for cases where that disagreement should be represented in a different POS."

dancooke commented 5 years ago

@yfarjoun According to that definition you don't even need * at all. Also your example doesn't make sense. For a start, the genotypes aren't phased, in addition, the genotype at POS 12 for sample s2 should be 0/0 not 0/2 since there are no interacting deletions for s2. I think you meant that 0/1 is technically wrong. This is kind of my point of why we need technical correctness. If we can't even get these simple examples right then something is seriously wrong...

yfarjoun commented 5 years ago

thanks for spotting the typo @dancooke.

yfarjoun commented 5 years ago

but no I really mean that 0/0 is technically wrong since strictly speaking it means that s2 has the reference allele (all 5 bases of it) on both haplotypes. and it doesn't.

I think that the phasing element is irrelevant, the interpretation of the alleles is the same.

lh3 commented 5 years ago

@yfarjoun is a core GATK developer. He confirms my speculation that the overwhelming majority of VCFs, knowingly or not, have adopted:

chr1  10  GTATATA G         0|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

All these VCFs are technically wrong according to the current spec. However, if we redefine 0 in the spec, these VCFs will become technically correct and most existing tools will continue to work without modifications. It is a complete and easily the cheapest solution.

@dancooke, thank you for raising the issue, which made us (me at least) clear that the spec has flaws. Meanwhile, you need to stand from a maintainer's point of view. It is close to impossible to flip the VCF ecosystem just for octopus, no matter how terrific it is.

nh13 commented 5 years ago

you need to stand from a maintainer's point of view

I really don't want to inflame this conversation, can the maintainers weigh in? @lbergelson, @pd3 & @jmmut

dancooke commented 5 years ago

@yfarjoun Ok well now you've changed the example it makes more sense. This type of example does demonstrate another reason why changing the meaning of 0 in GT is a bad idea - it results in misleading and inconsistent annotations:

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G   GT AC=1;AN=4      0/1 0/0
12  A     C,* GT AC=2,1;AN=4    1/2 0/1

So the first record suggests that the frequency of the reference allele is 0.75 but the second just 0.25 - but they describe the same haplotype. The results for each of my solutions:

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,*  AC=1,2;AN=4  GT   2|1 0|2
12  A     C,*  AC=2,1;AN=4 GT    1|2 0|1

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT*TA  AC=1,2;AN=4 GT  2|1 0|2
12  A     C,* AC=2,1;AN=4 GT       1|2 0|1

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA AC=1,2;AN=4  GT  2|1 0|2
12  A     C,* AC=2,1;AN=4  GT      1|2 0|1

In each case, all is well in the world. This type error will almost surely mislead analysis and waste time. I understand there's an initial cost associated with changing the current "ecosystem", but surly, now that the problem is understood and a solution proposed, is the time to fix it and potentially save many hours of otherwise future wasted efforts?

dancooke commented 5 years ago

@lh3 I think it's a little disingenuous to suggest I'm arguing this simply for my own benefit. I didn't even raise this issue. I believe the changes I'm suggesting will serve the community - which is not isolated to the GATK ecosystem - best in the long term.

lh3 commented 5 years ago

0.75 is indeed the reference frequency up to the next overlapping record. If we redefine 0, everything will be consistent. Also, we most often care about the frequency of the 1 allele, which is not affected either way.

Meanwhile, this observation rejects your ./1 genotype as AN will be wrong. Also, such multi-sample examples provide an argument against base *. Think what you will do when there is a 100kb deletion in a large population.

dancooke commented 5 years ago

@lh3 Ok but let's be honest, people do not process VCF that way - the vast majority of users won't be aware of such quirks of the specification; they will take the information provided in individual record annotations at face value, and won't consider overlaps at all (which I believe is the point @d-cameron was making). That's why it's so important to have it made explicit.

POS REF   ALT INFO FORMAT s1
10  GTATA G   GT    0|1
12  A     C,* GT    1|2

In the example above, is the reference on the first haplotype for 13-14 called or not? According to the 'hack' 0 definition it technically isn't since the record at 12 'blocks' the previous record. So if I want to state those bases are called reference, I have to resort to adding additional reference call records covering those bases. Meaning that the base * output:

POS REF   ALT INFO FORMAT s1
10  GTATA G,GT*TA GT  2|1
12  A     C,*       1|2

Is semantically equivalent to the hack 0 output:

POS REF   ALT INFO FORMAT s1
10  GTATA G   GT    0|1
12  A     C,* GT    1|2
13 TA <*>,* GT 0|2

rather than the first example shown. This is simply unacceptable - and really does introduce completely unnecessary redundancy. This will also be a headache for code generating VCF. Then there's also the problem of whether the next record is filtered, does that still 'block' the previous record or not?

I never claimed ./1 was the correct solution. If there's a 100kb deletion in a large population then the output of base * will be almost identical to the well-defined symbolic * definition, only the deletion itself - and any other partial overlapping alleles - will contain additional ALT alleles.

jmmut commented 5 years ago

IMO the question is within clarifying 0 as "reference until next record", or changing * from "overlapping deletion" to "ovelapping allele", which correspond respectively to option 2 (of @lh3 here) and option 1 (of @dancooke here and here and of @lh3 here)

As much as I dislike it, I think the best pragmatic way to address this is to clarify 0 as "reference until next record", at the very least for 4.3 and previous versions, given that it's the de facto standard. I think the worst we could do is to allow different VCFs in the same version to mean different things.

Maybe it's not wise to change this even for future versions of VCF. Usually "several standards" is worse than "a single mediocre standard, with its flaws clearly stated" because it is more confusing to users, requires more knowledge about format history and requires more support from tools.

However, for the next VCF version, I would probably favour changing * from "overlapping deletion" to "ovelapping allele", to be able to restrict the meaning of 0 to "true reference", basically replacing just the usage of those 0-but-not-really to *. I think this would allow correct representations to most of @dancooke concerns, while being familiar to users and tools. This option still has the problem of 0 meaning different things across versions, though.

I totally agree that a for new formats, the base-* is an interesting take. It's less ambiguous and uses record interdependence in a well-defined way. It looks actually still correct when filtering out some variants, as some analysis do. However, on top of needing parsers support, I still see it as more complex. Another corner case example is that there are other issues open on how to merge VCFs, and adding more samples to a VCF with new variants that overlap deletions would require updating those deletions too, adding complexity due to the interdependence.

So moving to concrete actions, I humbly think what we should do for the current version at least is:

Clarify that GT=0 means REF allele only until next record.
Clarify ambiguity on the "Strings made up of the bases A,C,G,T,N," part to ban `` mixed with bases.

I'm sorry there's no solution here that pleases everyone.

dancooke commented 5 years ago

@jmmut If we're really going to go with the re-define 0 option then the re-definition needs to be more like:

GT=0 means REF allele unless specified otherwise by downstream records.

Otherwise you have the problem I described above. Essentially then 0 adopts the meaning of a symbolic * but for downstream records (in addition to meaning REF).

I think it's also worth pointing out that - as far as I'm aware - the only two open source variant callers that use * at all are GATK4 and Octopus. Symbolic * support was added to GATK4 in version 4.0.9.0 (released on 20 September 2018), while Octopus has supported base * since its release in October 2016. I therefore don't think it's accurate to say that using * and 0 in the GATK4 (v4.0.9.0+) way is "standard". Probably the most widely used interpretation of 0 is actually REF unless specified otherwise by any other record, since this is what every caller other than GATK4 (v4.0.9.0+) and Octopus use (to the best of my knowledge). This would technically need to be the re-definition of 0 for pre-v4.2 VCF (without *), and could be used for v4.2-3 to achieve consistency across versions (although then * in v4.2-3 becomes redundant). Furthermore, notable callers that don't use * still use VCF v4.2 (e.g. DeepVariant), so the definition of 0 would need to be the same as v4.1 unless their calls are to break.

I think a fair compromise would be to re-define 0 for VCF up to and including v4.3, but the next version should use either the well defined symbolic * or base *. I think it would be reasonable to request that this new version of VCF is made available promptly (ideally alongside 4.3). This way, nothing breaks for the majority of tools that don't use *, GATK can maintain 4.3 and update to the newer version at their own pace if desired, while Octopus and other tools wishing to adopt * can take advantage of the newer version without having to conform to GATK's choices.

pd3 commented 5 years ago

I can confirm that the second solution from @lh3's list is in my experience the most common interpretation of complex events that span multiple records.

The REF and ALT columns can be viewed as operations that need to be applied in order to obtain the actual haplotype sequence. In a set of overlapping rows the 0 allele is considered as "not changing the REF allele at this record", but should not be interpreted rigidly as the reference allele if other conflicting records follow (or precede).

Note that this interpretation has been in use for various purposes for a long time, e.g. when breaking multiallelic sites into biallelic sites in order to make them digestable by imputation programs.

I agree that the specification needs a clarification to prevent future confusion. I support @lh3's proposal, only would remove the requirement of ordered records as I think stating "overlapping" is more general and there have been strong concerns (now and in past) about the order requirement.

dancooke commented 5 years ago

@pd3 Then * should be removed from the specification as it would be made redundant:

chr1  10  GTATATA G         0|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

chr1  10  GTATATA G         0|1
chr1  15  TAGCGA  T         1|0
chr1  18  C       CT,*      2|1

chr1  10  GTATATA G         0|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT        0|1

chr1  10  GTATATA G         0|1
chr1  15  TAGCGA  T         1|0
chr1  18  C       CT        0|1

Would all describe the same set of haplotypes.

pd3 commented 5 years ago

No, I don't think * should be removed from the specification. The proposal is: 1) no new * base 2) the * allele remains to indicate overlapping alleles from different records 3) the interpretation of 0 suggested as above ("no REF change at this record"). In this meaning it is used only as a default fallback to resolve possible conflicts and ambiguities

dancooke commented 5 years ago

@pd3 Are you suggesting that under your proposed re-definition of 0 and * that the these examples are not semantically equivalent? If so I am truly lost.

lh3 commented 5 years ago

the only two open source variant callers that use * at all are GATK4 and Octopus

Htsbox has been using * since 2015 or 2016. This actually doesn't matter. Even before * was introduced, 0 had never been the REF haplotype in fact.

I would probably favour changing * from "overlapping deletion" to "ovelapping allele", to be able to restrict the meaning of 0 to "true reference", basically replacing just the usage of those 0-but-not-really to *.

This changes the semantics of VCF. A record in today's VCF effectively describes the interval maximally between this and the next record. With your change, a record will describe the interval of the current record. This is a subtle but important difference. Your proposal sounds better in theory, but will have problems in practice. Suppose we have a 100kb deletion in a population. If 0 only means the reference haplotype, we will be unable to determine the number of 0s with short reads because we won't be able to phase through this 100kb region in diploid samples without this deletion. Today's VCF doesn't have this issue because it doesn't require haplotype information or phasing.

Earlier I said "[redefining *] would be the top choice in the early days of VCF". Now I am in favor of the current practice, which is the one you and @pd3 have described.

jmmut commented 5 years ago

@pd3

the * allele remains to indicate overlapping alleles from different records

the spec says "overlapping deletion", did you mean change to "overlapping allele" or remain as overlapping deletion?

pd3 commented 5 years ago

@jmmut Yes, I meant "overlapping allele". Can't think of any harm it might cause, happy to be proven wrong though.

jmmut commented 5 years ago

@pd3 Then I'm lost too because this is the very issue that we were trying to avoid by staying with the "0 means not changing the REF allele at this record". This sounds to me doing both options I described, and having the disadvantages of both. (Do the rest think it's ok to change the spec to both "0 means not changing the REF allele at this record" and * to mean overlapping allele? this would mean that you can not state a "true reference" even if you use *)

@lh3 I'm sorry but I don't understand what problem this scenario represents. Could you elaborate, please? Would this still apply if we did both changes?

Suppose we have a 100kb deletion in a population. If 0 only means the reference haplotype, we will be unable to determine the number of 0s with short reads because we won't be able to phase through this 100kb region in diploid samples without this deletion.

Also, with the "0 means not changing the REF allele at this record" interpretation, I don't understand (just as @dancooke) if these are semantically equivalent to @pd3 or not, and if yes, why * was introduced in the first place. If they are not equivalent, then I don't know which one is the correct one nor why.

samtools / hts-specs

VCF spec allows ALTs that are mixes of * and bases, but doesn't define how to interpret them #437