Genotype likelihood normalization for alt1/alt2

anands-repo commented 5 years ago

I was interested in understanding how heterozygous alt cases are dealt with, and came upon the explanation in pileup_image.py, which says that three different images are evaluated to obtain the necessary genotype likelihoods. I tried to scan further to see the code that actually does the evaluation of multiple images to obtain genotype likelihoods, but I am unable to locate the code. Could you please point me to the relevant source code section?

pgrosu commented 5 years ago

There are many moving parts in DeepVariant to understand all that is going on code-wise. So let's start at a easier level. Does the following article help: https://blog.dnanexus.com/2019-02-19-deep-dive-into-deepvariant/

anands-repo commented 5 years ago

Thanks very much for pointing me to the blog. It is extremely well-written and very helpful. However, it raised another question for me.

In the blog, it mentions how multiple alternative alleles are treated at a site - stating that for each alternative allele combination, an image is generated. This means, more than two alleles may be considered at a site.

In the DeepVariant paper (https://www.biorxiv.org/content/biorxiv/early/2018/03/20/092890.full.pdf) it is mentioned, "Candidate haplotypes are generated by traversing the assembly graphs and the top two most likely haplotypes are selected which best explain the read evidence." At the time of reading this, I believed that this meant only the top two alleles at a site are considered for example generation, since top-two haplotypes will enforce having only two alleles at a site.

However, it seems to me, after reading the blog, that the candidate haplotype generation is used only to reinterpret the CIGAR strings, and during pileup read generation, the "top-two candidates" have no relevance and all alternative alleles are considered for example generation. Is this correct?

pichuan commented 5 years ago

Hi @anands-repo , to your original question about how the multiple images are combined, you can find the logic in https://github.com/google/deepvariant/blob/r0.8/deepvariant/postprocess_variants.py Specifically, the function merge_predictions is the one that takes multiple images for multi-allelic cases and merge them. As you can see, the comment in that function also refers to: "See the logic described in the class PileupImageCreator pileup_image.py", which is probably the logic you found.

For your last question, I'll have to read and reply later.

pgrosu commented 5 years ago

That paper is an early version of the code - so some things changed and some didn't. So let's parse this out:

1) In the blog, if you look at Jason's Jupyter notebooks, you'll see him using --norealign_reads which skips the preprocessing step described in the paper. That's why he's getting the T which a is a possible sequencing error.

2) The general realignment heuristic is the following: https://github.com/google/deepvariant/blob/r0.8/deepvariant/realigner/realigner.py#L439-L454

3) The ploidy dictates the most likely haplotypes which is 2 :)

3) The haploytype realignment also been optimized a bit since that paper. Take a look at this section of the code: https://github.com/google/deepvariant/blob/r0.8/deepvariant/realigner/fast_pass_aligner.cc#L117-L166

4) Yes, the realignment is basically used to update the reads' interpretation via the best-representative haplotype, trying to approach what might be closest-truth if the sequencer were relatively perfect. It's basically a way to correct.

5) Variant calling comes later, and is performed via the following, which is called from make_examples.py:

https://github.com/google/deepvariant/blob/r0.8/deepvariant/variant_caller.py https://github.com/google/deepvariant/blob/r0.8/deepvariant/variant_calling.cc

6) The predict(...) in call_variants.py emits the genotype_probabilities, which Jason used to demonstrate the low probability of T as a sequencing error. Just like any machine-learning trained model, if it's not seen the pattern before it will not predict it with a high match - no real magic here :)

7) It might help to read the deepvariant.proto, which you can treat like a spec-design document of the general idea behind DeepVariant. The details doc might help, but it's a bit general.

Hope it helps, ~p

pichuan commented 5 years ago

Overall, I feel like this thread has mentioned a few different things that in my mind are pretty different.

Realigner, ( @pgrosu 's pointer 2. 4. above) , which does a local realignment, which was proven helpful especially for our Indel accuracy. Note that this was turned off by default for PacBio model. This is something that @akolesnikov will be able to answer more questions on.

How candidates are proposed (which seems to be the main question at the initial question) is based on set of thresholds, which can be adjusted with these flags: https://github.com/google/deepvariant/blob/r0.8/deepvariant/make_examples.py#L189-L202 If you change these thresholds in your make_examples during calling, you will be able to get different number of candidates proposed in the make_examples stage. But note that our default release models are trained with examples based on the default thresholds. @pgrosu 's pointer 6 above is the pointer of how these initial set of candidates are proposed before they're passed to the classifier (used in call_variants step). And in fact, if you look at the final output VCF at the end, it should include all the candidates. If the classifier decided a candidate is not actually a variant, it will get a 0/0 genotype and have "RefCall" in its FILTER field.

@anands-repo , I'm not sure if my answer and Paul's answer cover everything you want to ask now. If not, feel free to follow up.

anands-repo commented 5 years ago

@pichuan @pgrosu This improves my understanding substantially.

After reading the paper, my initial impression was that DeepVariant consists of two parts - one part following traditional callers performing local assembly, haplotype detection and candidate allele generation based on the top-two haplotypes, and a second part which is a DNN used to perform candidate filtration. In such a scheme, the second part would be presented a single image per site and it makes a determination among homozygous-ref, heterozygous, and homozygous-alt in an allele-agnostic way - hence it needs a single image per site. In such a caller, 3 allele candidates at a site wouldn't be possible. I was also surprised that such a method could outperform all others, since this approach simply (approximately) replaces the VQSR step in GATK or the RandomForest in Strelka, and a lot would be riding on heuristic-based algorithms for candidate selection.

I understand now that all (reasonable) candidates at a site are evaluated and scored against each other using the DNN, which seems to me to be a much more satisfying approach.

I am still interested in how different predictions are combined into a single variant quality score, since they are not normalized; hence my request for guidance towards the code source that combines multiple predictions into a variant call. I will go through these resources over the next day.

Thanks a lot! Much appreciated.

google / deepvariant

Genotype likelihood normalization for alt1/alt2 #197