Creating a universal SNP and small indel variant caller with deep neural networks

agitter commented 7 years ago

Next-generation sequencing (NGS) is a rapidly evolving set of technologies that can be used to determine the sequence of an individual's genome by calling genetic variants present in an individual using billions of short, errorful sequence reads. Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and parameterized statistical models used for variant calling still produce thousands of errors and missed variants in each genome. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships (likelihoods) between images of read pileups around putative variant sites and ground-truth genotype calls. This approach, called DeepVariant, outperforms existing tools, even winning the "highest performance" award for SNPs in a FDA-administered variant calling challenge. The learned model generalizes across genome builds and even to other species, allowing non-human sequencing projects to benefit from the wealth of human ground truth data. We further show that, unlike existing tools which perform well on only a specific technology, DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, from deep whole genomes from 10X Genomics to Ion Ampliseq exomes. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.

They encode the input data as an RGB image. I'm not yet sure if that's for convenience with respect to existing code and algorithms or if there are benefits over alternative, more direct, encodings.

agitter commented 7 years ago

The basic idea is to encode input instances as RGB images. Each channel carries different information about the sequence:

Red: base (A, C, G, T)
Green: read quality
Blue: strand

The reference is encoded as a row in the image and the remaining rows encode reads supporting a candidate variant. My working assumption is that they use RGB images in order to leverage all of the existing code and training networks for images. For example, "the CNN was initialized with weights from the imagenet model ConvNetJuly2015v2". Also, the full RGB feature space isn't needed. It looks like the blue channel will take only two possible values that encode positive or negative.

Note the methods section appears in a separate supplement. If you want to spend 5 min to understand the core idea, check out the Creating images around candidate variants section, which has the code that implements what I described above.

The robustness of the trained model against different "sequencing depth, preparation protocol, instrument type, genome build, and even species" (emphasis mine) is pretty amazing.

The code is in a private alpha and should become publicly available later.

gwaybio commented 7 years ago

Reference from #150 above discusses this tweet

agitter commented 7 years ago

Regarding the tweet, it is easy to overlook the methods section in this paper because it appears in the separate supplement. As I linked above, the code is in a gray area. It's not (yet?) publicly available, but it also isn't completely private like many other papers.

gwaybio commented 6 years ago

First release of DeepVariant=0.4.0

agitter commented 6 years ago

Cell Systems had an editorial that discussed DeepVariant https://doi.org/10.1016/j.cels.2017.12.012

greenelab / deep-review

Creating a universal SNP and small indel variant caller with deep neural networks #159