Closed christacaggiano closed 7 years ago
Hi Christa,
I assume you have a set of variant calls obtained from some cancer sample without the matched normal and you are trying to filter germline mutations. You are correct in the sense that the Poisson cdf with N/2 is not used in the final output where the germline/somatic status is determined. Why that is the case and what you can do about it:
If you do have matched normals, set difference is the way to go for the most accurate results. Call variants in both samples separately and filter those that are shared.
While the Poisson model is pretty accurate, it is overkill to compute in most cases. As the second reply nicely states in https://www.biostars.org/p/65080/ , it is usually enough to label any mutation with frequency >= %50 as germline (you can pull this down to ~40 to be more strict) and then worry about the mutations < 10% being noise/artifacts or true variants. Still not as good as having matched normals but mostly does the job since you are more interested in low-frequency variants in such cases anyway.
Best,
Can
Hi! Thank you so much for your response! To be clear then, the algorithmic implementation in SinVict is not exactly what was described in the paper then?
Thanks! Christa
True, the idea from item (2) above is used in the current implementation for speed improvement. The paper version captures the germline/somatic distinction somewhat better, but again it really shouldn't matter if you are working with cfDNA. That said, if you really need to use the exact formula on the paper, I could add it and provide an option to switch between the two modes.
Best, Can
Hi Christa, I just added an option to use the poisson cdf as described in the paper to guess the somatic/germline status that you can use if you need it.
Note: It is on the 'devel' branch so make sure you checkout that branch and I made it the default option on that branch. If you want to turn it off and use the simpler version we discussed instead, use "-s 0" option on the command line.
Thanks Can that is extremely helpful! Thank you for your time and effort.
Hi,
I see in your paper for the second time you pass through the Poisson cdf, you use a value of N/2. However, I do not see that reflected in your code, so I was wondering how you implemented that. Is this lambda value still times the avg error rate (being 0.01 for Illumina)?
Thanks! Christa