samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
657 stars 172 forks source link

Is the star allele (*) considered symbolic or not? (a discussion about VC types) #151

Open yfarjoun opened 8 years ago

yfarjoun commented 8 years ago

The VCF spec discusses symbolic alleles as an angle-bracketed ID String “<ID>” (in 1.6.1.4) but the overlapping deletion allele is *. I suspect that the intention is that the star allele be considered a symbolic allele. The specific deletion which is overlapping can depend on the sample/genotype and thus cannot be said to be a specific allele which is simply not spelled out.

In HTSJDK a VariantContext has a "type", as does an Allele. This isn't spelled out in the VCF spec and so I'm not sure if other VCF parsers do this as well (and if they do, whether it is using the same definitions...). The classification seems to be based on this.

Currently, since the star allele isn't considered symbolic, the VariantContext with it is considered a SNP (all the alleles are of length 1). I would like to change that but am concerned that there are issues that I haven't considered.

Since Allele type and Variation type are not specified in the VCF spec (as far as I could see), different implementations are thus free to do what they wish, but I suspect that we should decide as a community how to approach this so that we can agree on the meaning of basic things like "how many SNPs does a VCF have?"

yfarjoun commented 8 years ago

Tumbleweed?

Does no-one care or no-one has thought about this or no-one has a strong opinion one-way or another?

I'll put in a spec-change PR and see if that will make more people chime in..

d-cameron commented 8 years ago

There is already a star alternate allele <*> defined in VCFv4.3 section 5.5 which is different from the section 1.6.1.5 * "missing" alt allele. Unfortunately, the wording of 1.6.1.5 seems to indicate that AN*G*GG is a valid alternate allele (and that case insensitivity is explicitly allowed which, if actually implemented, breaks SV symbolic alleles).

I raised an issue with the htsjdk allele class design a while back (see https://github.com/samtools/htsjdk/issues/18). My preference is for an API design that can distinguish between SNVs, SVs, and both star alleles, but I think that is an implementation issue, not a specifications issue.

On Sat, Sep 24, 2016 at 1:57 PM, Yossi Farjoun notifications@github.com wrote:

Tumbleweed?

Does no-one care or no-one has thought about this or no-one has a strong opinion one-way or another?

I'll put in a spec-change PR and see if that will make more people chime in..

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/151#issuecomment-249343411, or mute the thread https://github.com/notifications/unsubscribe-auth/AFwcOKsAYO3R6P-Sp9Mv-i3EwZTA4icnks5qtJ-0gaJpZM4Izcwl .

pd3 commented 8 years ago

The star allele * is not a SNP. For example, if there is a big deletion in one sample and another sample has a SNP in the deleted sequence, there is the question how to represent it: 0/0 would mean the reference allele, which it is not. One could use the missing genotype ./., but that could also mean that the genotype could not be determined. The star allele allows us to represent this situation.

The term "symbolic allele" refers primarily to anything enclosed in brackets <>. In a broader sense, the term is often used to describe situations where the sequence of the alternate allele is not or cannot be given explicitly. For example all the SV events. Or the "unobserved allele" <*> which is used as a placeholder to express all genotype likelihoods.

Strings like AN*G*GG should not be allowed. I don't really know what it'd be good for or how to interpret it.

Lenbok commented 5 years ago

The specification is currently vague about whether the use of * to represent a spanning deletion must use * as the whole allele.

"Options are base (sic) Strings made up of the bases A,C,G,T,N,*, ..." makes it seem like * can be freely mixed with regular bases.

"The ‘*’ allele is reserved to indicate that the allele is missing due to an overlapping deletion." makes it seem like representation of spanning deletion should use * as a whole ALT.

In the case of an insertion or deletion that coincides with the edge of a spanning deletion, the requirement to add an anchor base would mean that either the boundary of the spanning deletion is being implicitly moved, or the anchor base must also be added to the spanning deletion allele.

Similarly in the case of two partially overlapping deletions, you might want to add bases to each spanning deletion allele to indicate where the overlapping deletion stops.

The alternative is to disallow mixing * with other bases and thus use of this allele does not imply that the entire corresponding haplotype has been deleted (although this reduces it's utility).

Note that the Octopus variant caller (from @dancooke) uses this "partial spanning deletion" notation currently.