Candidate detection in PEPPER-SNP

GuillaumeHolley commented 2 years ago

Hi hi,

I have a question about the following example:

pepper_example

In this example, there are 2 SNP candidates. I would have expected PEPPER SNP to report both of them as candidates but only the one on the left is reported (with FILTER=PASS). The one on the right is not reported, not even as a RefCall. My PEPPER SNP parameters are the following:

pepper_example2

It seems to me the second candidate on the right passes all these thresholds (the 4 bases supporting the alt have base quality above 30, all reads have mapq 60). I was wondering if you had an idea about why the variant doesn't make it as candidate?

Thanks, Guillaume

PS: I am using a custom-made model for PEPPER SNP.

kishwarshafin commented 2 years ago

@GuillaumeHolley , can you please try with --snp_p_value with 0.0? If you are using the docker, then it can be invoked with --pepper_snp_p_value.

GuillaumeHolley commented 2 years ago

@kishwarshafin Thanks, I tried that parameter and I can see it now as a reference call with a GQ of 1. Using a value of 0.02 doesn't work which means that this variant has no predictive value at all. I assume "predictive value" is model dependent, i.e, a model trained differently (on different data) would give the variant a higher predictive value?

I am asking this because I am using the trained models on regions which are notoriously problematic for some in-house samples. We have a small set of variants that we are confident in for these regions and I am comparing the calls from Pepper to these variants. I am noticing there are much more false negatives for the Pepper calls than false positives. If I look at the Pepper-SNP calls, most FN are not picked up by Pepper-SNP because they probably have no predictive value. Hence, my question would be: do you think training a model specifically for these regions using the small set of variants we are confident in would help improve the calls in these regions?

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

I'll try to give the best answer I can, please let me know if any of these is not clear to you.

Having a low predictive value can mean a lot of things. However, the way I interpret is how the data is presented to the model. The machine learning models try to find a differentiable signal between TP and FP given a set of features. In these difficult regions, if the error rate of nanopore is so high that the summaries PEPPER generates makes it very difficult to differentiate between a TP and FP then the model would try to do what's best and balance it. In this case, I'd say you'd see a lot of FP calls that look similar to the pileup you gave. If you want to increase sensitivity in these regions then while trying, you'll have to play around with -p parameter of the make_train_images.

On the other hand, DeepVariant uses a rich set of features. So one of the ways to get better results in these regions is to set the highest sensitive mode of PEPPER by setting --snp_p_value 0.0 and then train DeepVariant to see if it can solve these cases. My guess will be that you'll get much better results if you re-train DeepVariant rather than just train PEPPER for higher sensitivity.

Please let me know if the answer is not clear to you and I'll try to clarify further.

GuillaumeHolley commented 2 years ago

Hi @kishwarshafin,

Thank you for the clear explanation and your input on the matter, it does makes sense to me. Closing now.

Guillaume

kishwarshafin / pepper

Candidate detection in PEPPER-SNP #129