samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
643 stars 174 forks source link

VCF spec allows ALTs that are mixes of * and bases, but doesn't define how to interpret them #437

Open tfenne opened 5 years ago

tfenne commented 5 years ago

The VCF spec contains this text on ALTs:

Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (“\<ID>”) or a breakend replacement string as described in the section on breakends. The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion

I think probably the intention here was to allow either * or [ACGTN]+, but not a mixture of the two. But as written it theoretically allows alleles like *A*C*G*T*, but has nothing to say on how such alleles should be interpreted.

I've recently come across VCFs generated by Octopus that contain alleles that start with a * and end with base sequence, e.g. *AC. This is technically valid VCF, but I suspect most tooling won't support it (GATK tools fail on it). In their case they seem to be using it to indicate that the anchor base in a deletion record is covered by an upstream deletion, but not the whole allele.

I think the spec should either be updated to clarify this kind of usage or if the horse is already out of the gate, prohibit it.

dancooke commented 5 years ago

@jmmut I believe that @lh3 is describing the following situation. Suppose we have (I'm going to use the most explicit base * form for clarity):

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GTAT#  GT   2|1 2|3
12  A     C,*  GT             1|2 1|2
14  A     T,*  GT             2|2 2|1

Now suppose that we cannot phase the variants at 12 and 14 for sample s2. How do we represent this? We can no longer assert the genotype 2|3 at position 10 for s2 as this implicitly determines the phase of the entire region. I believe that the solution is:

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GT#T#  GT:PS   2|1:10 3|3:10
12  A     C,*  GT:PS             1|2:10 1|0:12
14  A     T,*  GT:PS             2|2:10 0|1:14

Which accurately describes the situation. The solution for symbolic * would be

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,*  GT:PS   2|1:10 2|2:10
12  A     C,*  GT:PS   1|2:10 1|0:12
14  A     T,*  GT:PS   2|2:10 0|1:14
jmmut commented 5 years ago

Ok, that makes sense, thanks. But if you don't know the phase, why not using the unphased separator /?

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,*  GT   2|1 2/2
12  A     C,*  GT   1|2 1/0
14  A     T,*  GT   2|2 0/1

As I understand it, PS is for grouping variants under the same phase when you don't know the absolute phase of that group, right? but s1 can be absolutely phased and there are no phase groups for s2 (in these variants). I'm not aware of any restriction mixing / and | in the same row, or in the same sample.

Also, I still don't see why is this an argument against * meaning "overlapping allele" to allow 0 to mean "true reference".

dancooke commented 5 years ago

@jmmut You could do that for this example. More generally, I like to use PS for every record because once we get past these toy examples we'd almost certainty want to give a PS to the records in s1, and then we must specify PS for s2 also. It's really just a matter of personal preference.

I don't think this is an argument against symbolic * meaning that; I believe @lh3 was mistaken, or I interpreted his argument incorrectly.

dancooke commented 5 years ago

Here's a summary, how I see it

Then, VCF v4.2-3 could either

  1. Keep this definition, rendering * completely redundant, and introducing a type of semantic equivalence that will cause great confusion if not addressed going forward.
  2. Modify the definition of * to allow upstream overlaps with any allele and re-define GT=0 to mean REF allele until the next record (as suggested by @lh3) or re-define GT=0 to mean REF allele unless specified by a downstream record (as suggested by me). The former results in redundancy if we want to assert the reference past the next record. Either way should satisfy GATK, but break Octopus, and any tools using VCF v4.2+ but not using * (e.g. DeepVariant). Technically any VCFs (v4.2-3) not using * would immediately become invalid. It also results in different definitions of GT between VCF versions.

I would advise going with option 1, but having a new version of VCF ready that re-defines GT=0 to mean 'true reference' and gives * a complete definition. This is the least painful option; technically no existing VCFs should break. Parsers dealing with VCFs (up to the new version) containing * essentially just need to read * as 0 (since they would become semantically equivalent) - which should be easy if they're supporting pre-v4.2 VCF anyway. Moreover, * has yet to see widespread uptake, and this way, users and tools that do wish to adopt a meaningful * will need to explicitly opt-in (by using the new VCF version).

lh3 commented 5 years ago

I was in a full-day retreat.

There are three definitions for the 0 allele number. 0 could mean:

  1. The reference allele in full. We thought we were using this definition, but in fact 0 has never meant this, just as @yfarjoun said. I will show later this definition is actually a bad idea.

  2. The reference allele up to the next record. This is the interpretation in the current spec.

  3. There are no ALT alleles from the current record. This is the interpretation before we introduced the * allele.

I don't understand what problem this scenario represents.

@jmmut let's use your example:

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,*  GT   2|1 2/2
12  A     C,*  GT   1|2 1/0
14  A     T,*  GT   2|2 0/1

s2's genotype at pos 10 is in fact not determined. It could be 0/2 or 2/2 depending on the phase at 12 and 14. The root cause of this is in definition 1, 0 needs to be phased throughout, but we often don't have the phasing information. With the current spec (i.e. def 2):

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G    GT   0|1 0/0
12  A     C,*  GT   1|2 1/0
14  A     T,*  GT   2|2 0/1

It is ok. Definition 3 also works. Between 2 and 3, def 2 is more convenient as we can know the REF allele count up to the next record. More importantly, as the current spec has already adopted 2, I see little reason to revoke our earlier decision.

Going forward, we should just follow our current practice: clarify * to be a single allele and redefine 0 to be the reference allele up to the next record. It is actually optimal. (EDIT: I wouldn't say this is optimal, but given that we have decided on 2, reverting back to 3 is worse.)

dancooke commented 5 years ago

@lh3 The 'problem' you describe is exactly as I thought, but I explained quite clearly why this is not an actually problem. You are mistaken I'm afraid.

Going forward, we should just follow our current practice: clarify * to be a single allele and redefine 0 to be the reference allele up to the next record. It is actually optimal. (EDIT: I wouldn't say this is optimal, but given that we have decided on 2, reverting back to 3 is worse.)

This is simply unacceptable since it completely ignores the elephant in the room that the working definition of GT=0 (i.e. your option 3) has remained the same, despite * being added to VCF v4.2. This was essentially confirmed by @pd3, but easily demonstrable by the fact that their are notable tools using VCF v4.2-3 without using * (e.g. DeepVariant). By going with option 2, you're condemning an unknowable number of VCF's and tools that use/accept VCF v4.2-3 without using * to become invalid. This would be ridiculously unfair since the definition of GT never changed in the specification, and * was poorly defined to begin with. In reality, your option 2 is only the "current practice" of GATK4 v4.0.9.0+. The only sensible way to address this is to give all existing VCF versions the same definition of GT, and option 3 is the only possible option. Then no existing VCF's are technically invalidated. The only changes required are for the few VCF parsers accepting * - since * then becomes semantically equivalent to GT=0 (a straightforward change).

Going forward, we can choose to define GT and * in another way for the next VCF version, offering a clean slate so to speak:

  1. Depreciate * and just keeping the previous version of GT.
  2. Define GT=0 to actually mean REF + one of the technically sound alternatives of * that I originally proposed.
  3. Your option 2 + some re-definition of *.

Option 1 would be a shame since the reason why * was originally added in the first place is clear. Option 3 is almost certainly the worst option since it has pretty much half the benefits of any choice of option 2 but most of the downsides of option 1 in addition to its own. Problems include:

Just to stress the first point. Let's suppose I'm calling germline/somatic variants in a tumour sample and get (using option 3)

POS REF   ALT INFO FORMAT s1
10  GTATA G SOMATIC  GT    0|0|1
12  A     C,* GT    1|1|2

Now I'm only really interested in somatic variants, so I filter my VCF for these (this is a very common thing to do):

POS REF   ALT INFO FORMAT s1
10  GTATA G SOMATIC  GT    0|0|1

So I'm now left believing that my somatic variant occurred in the context of a tandem repeat (the reference), but this is not the case. This is exactly the type of situation that will mislead analysis.

Any of the choices of option 2 have none of these issues. At this point I would settle for the first 'symbolic' * choice just to avoid the catastrophe of option 3, but I still firmly believe either of the base * choices are better.

lh3 commented 5 years ago

I explained quite clearly why this is not an actually problem.

No, you didn't. Slightly modify your example:

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GTAT#,GTCTT  GT   2|1 0|3
12  A     C,*  GT             1|2 0|1
14  A     T,*  GT             2|2 0|1

If phasing is unavailable, do you want to turn 0|3 to 2/2 for s2 at pos 10? If it is, * needs to represent the true REF allele in this case; if * may represent the true REF, any 0/1 genotype may be replaced by 2/1. Or you can say "* may represent the true REF if this can't be determined", but then you need to define "determined".

the working definition of GT=0 (i.e. your option 3) has remained the same, despite * being added to VCF v4.2.

With *, the definition of 0 has been implicitly changed to def 2 accordingly. It doesn't matter what you "think" 0 is. I am good with both def 2 and def 3. I just think it is too late to revert back. We can acknowledge the two meanings of 0 in the spec. We have already been doing that in the SAM spec (e.g. the TLEN field).

lbergelson commented 5 years ago

I've been following along with this, but I'm on vacation with my family and each time I try to reply I get interrupted part way through, when I return the argument has moved.

I'm unconvinced that there is anything catastrophically wrong with the current use of the * allele.

We do definitely need to clarify what 0 means in the context of a downstream overlapping record.
The reference allele up to the next record. seems like it works for both for a large existing mass of vcfs as well as matches what many people have implicitly decided on.


It does seem like it's currently difficult to specify complex exact matches to reference using the existing * in some of the cases you mentioned. In general VCF hasn't done a very clear job of distinguishing when we believe something that isn't represented is reference vs when it is not defined.

How you interpret unspecified sites in the vcf mean depends on your assumptions about the provenance of the vcf and isn't explicitly encoded.

For instance, in the following, what is the genotype in any spot between 10 and 20?

10  A    C  GT     0|1
20  G    T  GT     0|1

It's unclear if it's a confident 0|0 or if it's unknown, and which it is depends on the provenance of the vcf. If we wanted to explicitly specify it we would need to add a homref block.

10  A     C           GT     0|1
11  G    <*>  END=19  GT     0|0
20  G     T           GT     0|1

Maybe we could clarify some of these examples in a similar way by adding a tag which can be used to specify exactly how long a match to the reference is called. Something like the proposed RBS. Here's an example using Haplotype-Reference-Block-Size (HRBS): a per haplotype integer value that specifies the number of reference bases included in a 0 call for that haplotype.

POS REF   ALT INFO FORMAT s1
10  GTATA G   GT:HRBS 0|1:2,.
12  A     C,* GT    1|2:
13  T     <*>,* GT:HRBS 0|2:2,.

or (somewhat less pleasant I think.)

It seems like you want to be able to explicitly encode this information, but I think doing so in the alleles is not the best option.


@dancooke Your proposed solution using * and # bases in alleles is problematic for processing large cohorts.

Based on this example:

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GTAT#  GT   2|1 2|3
12  A     C,*  GT             1|2 1|2
14  A     T,*  GT             2|2 2|1

For a large deletion in 1 sample, you would need to introduce specific alleles for every other sample which act like a mask for the specific pattern of overlapping alleles in each sample. This prevents finalizing this site until the entirety of the length the deletion has been processed. In a million sample vcf holding all the relevant information in memory becomes tricky with an approach like yours. It would also require you to add up to sample number new unique long alleles which isn't ideal consider current cohort VCF size currently scales super linearly with the number of alleles. Having each sample have unique alleles would also significantly decrease compressibility of GT blocks. Neither of these is necessarily a blocker, but it is an important consideration when adding things to vcf.

The existing allele instead requires you only to hold the alleles in memory which are still potential overlappers. For a 100kb symbolic deletion it's not clear what you would do with your scheme, just fall back to the existing allele behavior? Introduce a million new symbolic alleles and record the exact bases somewhere?


As a side note, I would be in favor of redefining * to be equivalent to <M> since there's no reason that I can see that it doesn't work for things like a symbolic inversion that overlaps the site.


Filtering a vcf with *-allele IS hard. There's no question about that. The /# notation would make it a bit easier by identifying which downstream/upstream sites are relevant when filtering, but I don't think that it significantly reduces the processing difficulty in large vcfs. With -allele you can keep track of relevant alleles and filter in the same way. Either way, it's difficult to decide what the meaning of filtering an allele is when it's overlapping another one and is probably going to application specific. For the example you gave, filtering sites while retaining the information is already supported in vcf with the use of site and genotype level filter fields. I don't see why you can't use those in the somatic/germline example you showed.

dancooke commented 5 years ago

@lh3 Your modified example is ill formed because you cannot have the ALT allele GTCTT at position 10.


@lbergelson I don't believe that you can confidently claim that the mass of VCFs are assuming this definition of GT=0, the truth is that we simply don't know whether or not this is the case. To simply enforce this retrospectively because the GATK team have decided upon this interpretation would be deeply unfair to the rest of the community that have formed different interpretations.


Regarding reference calls, while I agree this is a tricky topic, I do believe that the best way forward is in terms of explicitly called alleles. If I call

POS REF   ALT FORMAT s1
10  GTATA G,GT#TA GT  2|1
12  A     C,*       1|2

then I claim to have considered the two called haplotypes within whatever genotype model I'm using. In particular, I claim to have compared the reference base against a deleted base at each position where the reference has been called. From a haplotype perspective, I could just as well have called

POS REF   ALT FORMAT s1
10  GT  G    GT  0|1
11  TA  T,*A GT  2|1
12  A   C,*  GT  1|2
12  AT  A,*T GT  2|1
13  TA  T,*A GT  2|1

since the haplotype calls are the same (I would however argue that the overall interpretation is slightly different). I would then presumably report measures of uncertainty from my model (e.g. QUAL, GQ, GP, etc) for each record, which is by far the best way to report uncertainty in the reference. Moreover, since these two call sets are semantically equivalent from a haplotype perspective, I should be able to generate the exact same statistics for each reference base in the first example.

By adopting the GATK definition for GT=0, you're unwittingly breaking the semantic equivalence between these two call sets, since the former example could only be stated as

POS REF   ALT INFO FORMAT s1
10  GTATA G   GT    0|1
12  A     C,* GT    1|2
13 TA <*>,* GT 0|2

With this set of calls you make an entirely different claim about what the model has considered at positions 13 and 14, and cannot report the same measures of uncertainty as for the explicitly called bases.


You raise good points regarding the practicality of base *. I'll need some time to consider these in more detail. Right now I'm not overly concerned whether base * or symbolic * with explicit REF is adopted - I just want to see GATK-style GT=0 taken off the table.


I don't understand how you would filter my germline/somatic example while retaining the information from the germline. Under the explicit REF with symbolic * my example becomes

POS REF   ALT INFO FORMAT s1
10  GTATA G,* SOMATIC  GT    2|2|1
12  A     C,* GT    1|1|2

so when I filter for SOMATIC (e.g. with bcftools view -i SOMATIC=1) I get

POS REF   ALT INFO FORMAT s1
10  GTATA G,* SOMATIC  GT    2|2|1

which is considerably better than having 0|0|1 in my final somatic call-set. Are you saying that you can achieve this with per-record filtering from the GATK-style calls?

lh3 commented 5 years ago

you cannot have the ALT allele GTCTT at position 10.

Why? Either way, that allele doesn’t matter. What matters is how you encode the true reference allele without phasing.

dancooke commented 5 years ago

@lh3 Because then you're asserting the values of positions at 12 and 14 but re-defining them later.

What matters is how you encode the true reference allele without phasing.

This is an oxymoron. You can't state true reference for this allele if you don't know the phase. It would be like me asking how you can assert the reference in

POS REF   ALT INFO FORMAT s1
10  AA  CC  GT    0/1

if I don't know the phase between the two bases. The correct way to encode the situation is

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GT#T#  GT:PS   2|1:10 3|3:10
12  A     C,*  GT:PS             1|2:10 1|0:12
14  A     T,*  GT:PS             2|2:10 0|1:14

since the allele GT#T# does not require phase information for the variants at positions 12 and 14.

lh3 commented 5 years ago

I still don’t understand why this is wrong (oh, a typo: it should be 0|4, but my argument is still there: you think * can represent reference allele; then all 0/1 can be written as 2/1)

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GTAT#,GTCTT  GT   2|1 0|4
12  A     C,*  GT             1|2 0|1
14  A     T,*  GT             2|2 0|1

Are you thinking about 4-gamete test?

dancooke commented 5 years ago

Ah sorry, I missed that you'd modified the example. The correct way to encode this would be

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GT#T#  GT   2|1 0|3
12  A     C,*  GT             1|2 2|1
14  A     T,*  GT             2|2 2|1

If the region is un-phasable for s2 then it becomes

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GT#T#  GT   2|1 3/3
12  A     C,*  GT             1|2 0/1
14  A     T,*  GT             2|2 0/1
lh3 commented 5 years ago

Then GT#T# and the ref allele may be the same thing. It is indeterministic. The problem is more obvious when you use symbolic *: 0/1 in many of our examples would be equivalent to 2/1 if * can represent ref. Everything is good with def 2 or 3, but with def 1, you need to clarify more things.

dancooke commented 5 years ago

@lh3 It is not indeterministic at all - it is precisely determined when the phase is known. # just means base specified by downstream record. In the first case, since we do know the phasing, the #s are uniquely determined. In the second case, when the phasing is not determined, the #s are not uniquely determined either - but the bases they refer to are determined. This is a completely accurate picture of both situations; given the set of REF and explicit ALT alleles these are the only ways to encode the situation, with and without phase information. What it seems like you want is for the phase to be undetermined but the reference allele to be determined, but this would clearly be a contradiction.

lh3 commented 5 years ago

It depends on how you define indeterministic. A definition of 0 depending on phasing is indeterministic IMO. Think this way. With def 2, the number of 0 alleles exactly corresponds to the number of ref alleles up to the next record. With def 1, the number of 0 alleles doesn’t mean much. Some other ref alleles may be hiding behind *. Def 3 doesn’t say about the ref allele, so it doesn’t matter.

dancooke commented 5 years ago

It is not 0 that may be undetermined but #, and neither definition changes; 0 only ever has one definition and meaning for a given record - it will never change no matter what happens around it. However, under your def 2 0s meaning can change since it is determined by what may, or may not, follow it; the meaning changes if you add or remove (or filter?) proceeding records. I believe this is fundamentally why we're having this debate - there is no consensus on what 0 meant in the past, or now, if it depends on its context. This is why we end up with @pd3 saying that he agrees with your definition 2 yet in the same breath asserting a conflicting definition. It's impossible to avoid representation of uncertainty in VCF since fundamentally what is being described is uncertain - I believe it if far preferable for that uncertainty to be encoded in unique symbols (e.g. * & #, base or symbolic) rather than in a range of values (allele indices) where just one of the values (0) is given special meaning.

dancooke commented 5 years ago

Another data point. The Genome In A Bottle VCFs (v3.3.2) - arguably some of the most important and widely used VCFs publicly available - all specify VCF v4.2 yet do not use the * symbol, indicating that the intended use of 0 was the same as for VCF v4.1 before the * symbol was introduced.

Extract from HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz:

##fileformat=VCFv4.2
...
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG001
1   63493035    .   CGGA    CA,C    50  PASS    platforms=2;platformnames=Illumina,10X;datasets=2;datasetnames=HiSeqPE300x,10XChromium;callsets=2;callsetnames=HiSeqPE300xGATK,10XGATKhaplo;datasetsmissingcall=CGnormal,HiSeqPE300x,IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;filt=CS_SolidPE50x50GATKHC_filt;difficultregion=AllRepeats_lt51bp_gt95identity_merged_slop5   GT:DP:ADALL:AD:GQ:IGT:IPS:PS    2|1:199:0,119,80:0,119,80:198:1/2:.:PATMAT
1   63493038    rs200371077 AG  A   50  PASS    platforms=1;platformnames=Illumina;datasets=1;datasetnames=HiSeqPE300x;callsets=1;callsetnames=HiSeqPE300xGATK;datasetsmissingcall=CGnormal,HiSeqPE300x,10XChromium,IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;filt=CS_SolidPE50x50GATKHC_filt;difficultregion=AllRepeats_lt51bp_gt95identity_merged_slop5    GT:DP:ADALL:AD:GQ:IGT:IPS:PS    0|1:218:99,119:99,119:99:0/1:.:PATMAT