google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.21k stars 722 forks source link

Performance on short tandem repeats and inversions #180

Closed njbernstein closed 5 years ago

njbernstein commented 5 years ago

I was wondering if how DeepVariant performs on short tandem repeats and small inversions has been characterized?

I imagine small inversions aren't a problem, but short tandem repeats might be.

Also, how does it spell del in variants (where the sequence has been deleted and a new sequence has been inserted)?

njbernstein commented 5 years ago

Also, how does the haplotype-aware realignment of reads differ from the first three steps of haplotype caller? https://software.broadinstitute.org/gatk/documentation/article?id=11068

AndrewCarroll commented 5 years ago

Hi @njbernstein

Performance on STR will be a function of the size of the event. In Illumina data, DeepVariant will likely stop calling events as they start to reach 100bp in size and larger.

DeepVariant will call STR events below this size (for example, here is a HET call repeat expansion in one allele and repeat contraction in the other from a DeepVariant HG002 WGS VCF):

10 50527727 . CTATATATATATATATATATATATATATATATATATATATA C,CTATATATATATATATATATATATATATATATATATATATATA 37.1 PASS . GT:GQ:DP:AD:VAF:PL 1/2:17:56:2,22,29:0.392857,0.517857:37,20,52,20,0,40

I don't have stratified accuracy metrics for STR performance, nor do I have comparisons of this to dedicated STR tools. I would imagine that dedicated STR callers perform better for the long (100bp+) events, due to specific approaches for that class of problem, below 100bp, I do not have an intuition as to which approach will perform better.

For complex variants, do the extent these are in a size range callable by DeepVariant, DeepVariant will represent the sequence-resolved candidates found for variation. Here is an example from a DeepVariant HG002 WGS VCF):

1 67310873 . CAAAAAAAAAAAAAAAAAAAGAAAAATTAAA C,CAAAAAAAAAAAAAAAAAAAAAGAAAAATTAAA 45.4 PASS . GT:GQ:DP:AD:VAF:PL 1/2:18:36:2,26,6:0.722222,0.166667:28,4,35,4,0,2

The second ALT allele has insertions of A at multiple places, so that this doesn't cleanly fit into a single contiguous set of inserted or deleted bases. In practice, these complex events will be rare in the size range that DeepVariant is designed to address as a small variant caller.

Accidentally (because we did not design or train DeepVariant to do so), DeepVariant will call much larger insertion events in PacBio CCS data.

Finally, with respect to your haplotype-aware question. Conceptually the first two steps are quite similar.

For the first step, identifying which regions to reassemble, DeepVariant employs a relatively simple model which identifies regions that will benefit from reassembly. The specific implementation differs from GATK (and from the linked description, the GATK logic sounds more complex). Benchmarks reassembling all regions with DeepVariant consistently show the same accuracy to the region selection version.

For the second step, conceptually, the methods are very similar. Both construct a de Bruijn graph of reference and alternate contigs. The same authors of these GATK methods are authors of DeepVariant, so apart from writing in C++ for speed, I expect these two to be conceptually similar.

For the third step the methods are entirely different. This is where DeepVariant applies a trained convolutional neural network, looking directly at the raw information across the reads, whereas GATK applies a PairHMM to calulate the likelihood of candidate full haplotypes based on their support.

njbernstein commented 5 years ago

@AndrewCarroll Thanks for the thorough response. This was extremely helpful.

Also re the haplotype caller comparison: I thought DeepVariant also uses a PairHMM to score haplotypes to re-align reads to, and then feeds that to the CNN.

From the paper:

The likelihood function used to score haplotypes is a traditional pair HMM with fixed parameters that do not depend on base quality scores. This likelihood function assumes that each read is independent. Finally, each read is then realigned to its most likely haplotype using a Smith–Waterman-like algorithm with an additional affine gap penalty score for homopolymer indels.

So both methods use PairHMM to score haplotypes, and assist in re-aligning reads, ya? After that, the similarity is nil between the methods as you mentioned.

Sorry if that was implicit in your response, but wanted to double check I understand.

AndrewCarroll commented 5 years ago

Sorry, there is a misunderstanding here. Only GATK uses the PairHMM to score haplotypes based on probability. DeepVariant uses does not use PairHMM at all, instead it uses a convolutional neural network. GATK's PairHMM is used in the 3rd stop from the linked document.

The two steps that are (conceptually) shared are identifying regions to apply reassembly to and assembling a de Bruijn graph of the reads.

njbernstein commented 5 years ago

So is the DeepVariant paper out of date then? Cause it states that a pair HMM is used or maybe I'm completely misunderstanding the point.

AndrewCarroll commented 5 years ago

The code for the current, released version of DeepVariant does not use PairHMM to score haplotypes. Since the submission of the DeepVariant manuscript, there have been 4 releases which have improved various aspects of the code, training regime, and training data for models.

The DeepVariant paper does validly describe the methods used in a working version, both the original PrecisionFDA submission and the improvements made for the first open source release (v0.4). However, there are further improvements which are not captured in that publication, and which are instead represented either in other joint publications (e.g. https://www.biorxiv.org/content/10.1101/519025v2) or in blogs produced by our team or close partners (e.g. https://ai.googleblog.com/2018/04/deepvariant-accuracy-improvements-for.html), (e.g. https://medium.com/tensorflow/the-power-of-building-on-an-accelerating-platform-how-deepvariant-uses-intels-avx-optimizations-c8f0acb62344).

njbernstein commented 5 years ago

Ah great. Thanks very much. I was just getting tripped up by the paper.

This was a great explanation. Thank you very much for your thoughtful responses.