samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Allow partially spanned alleles. #1413

Open tfenne opened 5 years ago

tfenne commented 5 years ago

Description

@lbergelson this is likely to be a slightly contentious change despite the small PR. The goal here is to support partially spanned alleles in the form they are used in octopus. See https://github.com/luntergroup/octopus/issues/75 for a description of what's going on, but the short version is that Octopus outputs alleles like *AT and **GCGA to represent cases where the first or first two (respectively) bases of the allele are spanned by an upstream event.

This isn't prohibited by the VCF spec, although it seems like it's an unintentional thing that alleles like this are spec compliant. Still, I'd like to be able to use HTSJDK's VCF API to read octopus VCFs and it currently blows up on records with alleles like this.

codecov-io commented 5 years ago

Codecov Report

Merging #1413 into master will increase coverage by 0.006%. The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #1413       +/-   ##
===============================================
+ Coverage     68.115%   68.121%   +0.006%     
- Complexity      8375      8376        +1     
===============================================
  Files            573       573               
  Lines          33978     33978               
  Branches        5668      5668               
===============================================
+ Hits           23144     23146        +2     
+ Misses          8644      8643        -1     
+ Partials        2190      2189        -1
Impacted Files Coverage Δ Complexity Δ
...ain/java/htsjdk/variant/variantcontext/Allele.java 77.703% <ø> (ø) 95 <0> (ø) :arrow_down:
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java 90.476% <0%> (+9.524%) 4% <0%> (+1%) :arrow_up:
lbergelson commented 5 years ago

Eh... I'm not a fan of this. I think we're going to make it explicitly against spec unless someone can make a REALLY compelling case.

Do you think this representation is a useful one, or do you just want to be able to read octopus files? Would the octopus authors be opposed to changing their representation in future versions?

I think it could make sense to extend the definition of * to be "spanning allele" rather than merely spanning deletion which would work for their second use case. It's not clear to me why the first case can't be represented as a complex event rather than a deletion followed by a partially deleted allele, or just a deletion followed by a change a few bases later so it's not partially deleted.

Anyway.. that is probably discussion for the specs issue instead of here. I really don't like making this wok by default because I don't think we want to encourage it. If you wanted to add it as a weird mode that could be controlled by one of the flags in Defaults I think I'd be happier with that. How would that work for you?

tfenne commented 5 years ago

Yeah, I thought you might have a reaction like that @lbergelson. My primary concern at this point is that I just want to be able to read Octopus VCFs without resorting to silly things like e.g. modifying the input stream on the fly on the way to the VCF parsing code!

@dancooke can give a better answer on what he, as the author of Octopus, thinks about the long term usage of this style, but my sense is that he is committed to outputting VCFs with a single ALT per record, and using phasing and spanned alleles to stitch things together.

Bigger picture, it looks like this was first brought up on htspecs back in June 2019 (https://github.com/samtools/hts-specs/issues/151) and again more recently by me about a week and a half ago (https://github.com/samtools/hts-specs/issues/437). I tend to think, actually, that Dan's use of partially spanned alleles in Octopus is quite nice and once you understand it, I like it better than using just * when the allele is only partially spanned, which I have seen the GATK do. Either way though, I would just like the spec to modified to either explicitly embrace this usage or disallow it so that we can all get back to more fruitful things!

In the short term I don't have any serious issues with having to specifically enable this behaviour via a -Dallow_partially_spanned_alleles or similar if that would get a :+1: from you.

dancooke commented 5 years ago

I think the best case that I can make is that it provides a neater definition that is more in line with the rest of the specification. The working definition that I've used for * is something like

This base is explained by an upstream (or matching `POS`) record.

While the meaning currently being used by GATK appears to be more like

Some or all of the leading reference bases in this allele are deleted - as explained by an upstream (or matching `POS`) record.

So suddenly we go from defining ALT alleles in terms of nucleotides to defining them in terms of both nucleotides and symbolic alleles (but distinct from the other symbolic alleles the spec permits). I'm not suggesting both definitions can't result in semantically equivalent information (w.r.t describing haplotypes), but I think the former is both clearer and more expressive.

A more practical benefit of the base-* definition is that reconstructing haplotypes from VCF is a little easier as there's no need to keep track of information from previous records (other than the sequence), since each record specifies exactly how many reference bases must be ignored.

jmarshall commented 5 years ago

@dancooke: What do you mean by ‘upstream record’? Another record earlier in the same VCF file?

(Records referring to other records seems problematic if the VCF file later gets filtered…)

dancooke commented 5 years ago

@jmarshall A record with smaller or equal POS (so yes, an earlier record in the same VCF if it's co-ordinate sorted). Problematic how, for reconstructing haplotypes? How is this different to the 'symbolic' interpretation? If a previous record is filtered, the reference is implicitly implied. RTGTools vcfeval (3.10.1) supports partially overlapping alleles with * and I haven't encountered any issues with filtered records.

tfenne commented 5 years ago

@jmarshall The same problem exists either way - if you use either @dancooke's representation or the standard * allele. I.e. if you have a deletion that spans a SNP variant, the GATK-style encoding would be to emit the deletion, and then a SNP with ref=A, ALT=*,T GT=1/2. If the deletion is then filtered but the SNP is not, you still have to make a decision at the SNP about how to interpret the * allele - does the genotype become 2/2 because you only observed allele 2 and now you're inferring that the other haplotype isn't deleted? Or do you make it 0/2 assuming that it's reference since the deletion no longer spans it. It's a general issue with spanning alleles, not specific to any one representation of them.

Most of the work I'm doing these days focusing on either a single sample or a very small number of samples at time where frankly combining events/alleles that overlap or are within a few bp into a single VCF record is preferred. But I understand that for larger cohorts this gets extremely messy and there's a lot of compound events, and emitting more simple records works better, hence the overall desire for spanning alleles.

I tend to agree that @dancooke's representation is easier to interpret once you understand it. I want to give a toy example that makes it clear why this is preferred sometime, and attempt to show how the representations differ.

# GATK representation
chr1  1000  GTATATA G         ... 0|1
chr1  1005  TAGCGA  T,*       ... 1|2

# Octopus representation
chr1  1000  GTATATA G         ... 0|1
chr1  1005  TAGCGA  T,**GCGA  ... 1|2

When reconstructing the two haplotypes in the sample, the latter representing makes it clear that the prior variant spans (deletes) two bases from the second ALT, and then the that allele picks back up with GCGA (i.e. the reference bases). When encountering variants like this with GATK you have to do extra work based on positions and allele lengths to determine that the * allele doesn't indicate that the second ALT is fully spanned by the preceding variant. I have been bitten by this previously where I'm trying to reconstruct the local haplotype across variants, and based on the spec had interpreted * to mean an allele was fully spanned by an upstream variant.

lbergelson commented 5 years ago

@tfenne Sorry, I didn't mean to totally shut this down in htsjdk. I figured you were going to come back with a change that adds a flag in defaults.

I'm happy to accept a pr that adds an opt in for this, I just don't want picard/gatk suddenly falling over in confusing ways when they encounter these things. If we decide to not ban it in the spec we can come back and add better support in a later version.

I don't want to make any decision about it in the spec before I talk with a few people who are on vacation right now. People are trickling back in after labor day.

tfenne commented 5 years ago

@lbergelson no worries - I wasn't taking this as 100% blocked, but rather figured it was useful to try and push the spec discussion forward some because it would be better to have the spec clarified and then just do the right thing here. If I get to the point where I'm blocked before the spec issue resolves I'll take a look at making this opt in. Sorry if I wasn't clear.

yfarjoun commented 5 years ago

I like the idea of having a STAR_IS_A_BASE flag in the Defaults class to enable reading of Octopus vcfs. Concurrently there should be discussion in hts-specs about the base-status of star.

yfarjoun commented 5 years ago

@tfenne this is in your court now to implement a "Defaults" parameter to unlock these "malformed" inputs.