Open tfenne opened 5 years ago
Merging #1413 into master will increase coverage by
0.006%
. The diff coverage isn/a
.
@@ Coverage Diff @@
## master #1413 +/- ##
===============================================
+ Coverage 68.115% 68.121% +0.006%
- Complexity 8375 8376 +1
===============================================
Files 573 573
Lines 33978 33978
Branches 5668 5668
===============================================
+ Hits 23144 23146 +2
+ Misses 8644 8643 -1
+ Partials 2190 2189 -1
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...ain/java/htsjdk/variant/variantcontext/Allele.java | 77.703% <ø> (ø) |
95 <0> (ø) |
:arrow_down: |
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java | 90.476% <0%> (+9.524%) |
4% <0%> (+1%) |
:arrow_up: |
Eh... I'm not a fan of this. I think we're going to make it explicitly against spec unless someone can make a REALLY compelling case.
Do you think this representation is a useful one, or do you just want to be able to read octopus files? Would the octopus authors be opposed to changing their representation in future versions?
I think it could make sense to extend the definition of * to be "spanning allele" rather than merely spanning deletion which would work for their second use case. It's not clear to me why the first case can't be represented as a complex event rather than a deletion followed by a partially deleted allele, or just a deletion followed by a change a few bases later so it's not partially deleted.
Anyway.. that is probably discussion for the specs issue instead of here. I really don't like making this wok by default because I don't think we want to encourage it. If you wanted to add it as a weird mode that could be controlled by one of the flags in Defaults I think I'd be happier with that. How would that work for you?
Yeah, I thought you might have a reaction like that @lbergelson. My primary concern at this point is that I just want to be able to read Octopus VCFs without resorting to silly things like e.g. modifying the input stream on the fly on the way to the VCF parsing code!
@dancooke can give a better answer on what he, as the author of Octopus, thinks about the long term usage of this style, but my sense is that he is committed to outputting VCFs with a single ALT per record, and using phasing and spanned alleles to stitch things together.
Bigger picture, it looks like this was first brought up on htspecs back in June 2019 (https://github.com/samtools/hts-specs/issues/151) and again more recently by me about a week and a half ago (https://github.com/samtools/hts-specs/issues/437). I tend to think, actually, that Dan's use of partially spanned alleles in Octopus is quite nice and once you understand it, I like it better than using just *
when the allele is only partially spanned, which I have seen the GATK do. Either way though, I would just like the spec to modified to either explicitly embrace this usage or disallow it so that we can all get back to more fruitful things!
In the short term I don't have any serious issues with having to specifically enable this behaviour via a -Dallow_partially_spanned_alleles or similar if that would get a :+1: from you.
I think the best case that I can make is that it provides a neater definition that is more in line with the rest of the specification. The working definition that I've used for *
is something like
This base is explained by an upstream (or matching `POS`) record.
While the meaning currently being used by GATK appears to be more like
Some or all of the leading reference bases in this allele are deleted - as explained by an upstream (or matching `POS`) record.
So suddenly we go from defining ALT
alleles in terms of nucleotides to defining them in terms of both nucleotides and symbolic alleles (but distinct from the other symbolic alleles the spec permits). I'm not suggesting both definitions can't result in semantically equivalent information (w.r.t describing haplotypes), but I think the former is both clearer and more expressive.
A more practical benefit of the base-*
definition is that reconstructing haplotypes from VCF is a little easier as there's no need to keep track of information from previous records (other than the sequence), since each record specifies exactly how many reference bases must be ignored.
@dancooke: What do you mean by ‘upstream record’? Another record earlier in the same VCF file?
(Records referring to other records seems problematic if the VCF file later gets filtered…)
@jmarshall A record with smaller or equal POS
(so yes, an earlier record in the same VCF if it's co-ordinate sorted). Problematic how, for reconstructing haplotypes? How is this different to the 'symbolic' interpretation? If a previous record is filtered, the reference is implicitly implied. RTGTools vcfeval (3.10.1) supports partially overlapping alleles with *
and I haven't encountered any issues with filtered records.
@jmarshall The same problem exists either way - if you use either @dancooke's representation or the standard *
allele. I.e. if you have a deletion that spans a SNP variant, the GATK-style encoding would be to emit the deletion, and then a SNP with ref=A, ALT=*,T GT=1/2
. If the deletion is then filtered but the SNP is not, you still have to make a decision at the SNP about how to interpret the *
allele - does the genotype become 2/2
because you only observed allele 2 and now you're inferring that the other haplotype isn't deleted? Or do you make it 0/2
assuming that it's reference since the deletion no longer spans it. It's a general issue with spanning alleles, not specific to any one representation of them.
Most of the work I'm doing these days focusing on either a single sample or a very small number of samples at time where frankly combining events/alleles that overlap or are within a few bp into a single VCF record is preferred. But I understand that for larger cohorts this gets extremely messy and there's a lot of compound events, and emitting more simple records works better, hence the overall desire for spanning alleles.
I tend to agree that @dancooke's representation is easier to interpret once you understand it. I want to give a toy example that makes it clear why this is preferred sometime, and attempt to show how the representations differ.
# GATK representation
chr1 1000 GTATATA G ... 0|1
chr1 1005 TAGCGA T,* ... 1|2
# Octopus representation
chr1 1000 GTATATA G ... 0|1
chr1 1005 TAGCGA T,**GCGA ... 1|2
When reconstructing the two haplotypes in the sample, the latter representing makes it clear that the prior variant spans (deletes) two bases from the second ALT, and then the that allele picks back up with GCGA
(i.e. the reference bases). When encountering variants like this with GATK you have to do extra work based on positions and allele lengths to determine that the *
allele doesn't indicate that the second ALT is fully spanned by the preceding variant. I have been bitten by this previously where I'm trying to reconstruct the local haplotype across variants, and based on the spec had interpreted *
to mean an allele was fully spanned by an upstream variant.
@tfenne Sorry, I didn't mean to totally shut this down in htsjdk. I figured you were going to come back with a change that adds a flag in defaults.
I'm happy to accept a pr that adds an opt in for this, I just don't want picard/gatk suddenly falling over in confusing ways when they encounter these things. If we decide to not ban it in the spec we can come back and add better support in a later version.
I don't want to make any decision about it in the spec before I talk with a few people who are on vacation right now. People are trickling back in after labor day.
@lbergelson no worries - I wasn't taking this as 100% blocked, but rather figured it was useful to try and push the spec discussion forward some because it would be better to have the spec clarified and then just do the right thing here. If I get to the point where I'm blocked before the spec issue resolves I'll take a look at making this opt in. Sorry if I wasn't clear.
I like the idea of having a STAR_IS_A_BASE flag in the Defaults class to enable reading of Octopus vcfs. Concurrently there should be discussion in hts-specs about the base-status of star.
@tfenne this is in your court now to implement a "Defaults" parameter to unlock these "malformed" inputs.
Description
@lbergelson this is likely to be a slightly contentious change despite the small PR. The goal here is to support partially spanned alleles in the form they are used in octopus. See https://github.com/luntergroup/octopus/issues/75 for a description of what's going on, but the short version is that Octopus outputs alleles like
*AT
and**GCGA
to represent cases where the first or first two (respectively) bases of the allele are spanned by an upstream event.This isn't prohibited by the VCF spec, although it seems like it's an unintentional thing that alleles like this are spec compliant. Still, I'd like to be able to use HTSJDK's VCF API to read octopus VCFs and it currently blows up on records with alleles like this.