Open agitter opened 7 years ago
The basic idea is to encode input instances as RGB images. Each channel carries different information about the sequence:
The reference is encoded as a row in the image and the remaining rows encode reads supporting a candidate variant. My working assumption is that they use RGB images in order to leverage all of the existing code and training networks for images. For example, "the CNN was initialized with weights from the imagenet model ConvNetJuly2015v2". Also, the full RGB feature space isn't needed. It looks like the blue channel will take only two possible values that encode positive or negative.
Note the methods section appears in a separate supplement. If you want to spend 5 min to understand the core idea, check out the Creating images around candidate variants section, which has the code that implements what I described above.
The robustness of the trained model against different "sequencing depth, preparation protocol, instrument type, genome build, and even species" (emphasis mine) is pretty amazing.
The code is in a private alpha and should become publicly available later.
Regarding the tweet, it is easy to overlook the methods section in this paper because it appears in the separate supplement. As I linked above, the code is in a gray area. It's not (yet?) publicly available, but it also isn't completely private like many other papers.
First release of DeepVariant=0.4.0
Cell Systems had an editorial that discussed DeepVariant https://doi.org/10.1016/j.cels.2017.12.012
https://doi.org/10.1101/092890
They encode the input data as an RGB image. I'm not yet sure if that's for convenience with respect to existing code and algorithms or if there are benefits over alternative, more direct, encodings.