luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

De novo mutation in cohort #110

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hello,

this is not a bug report but just a question about Octopus abilities. Let's say I have a cohort of asexual organisms, with a mother and n descendant populations. I have 2 questions

Thank you

dancooke commented 4 years ago

Let's say I have a cohort of asexual organisms, with a mother and n descendant populations

Do you really mean "n descendant populations" or "n descendant individuals"? In other words, do you know the ploidy of all samples or just the mother?

Does octopus make any assumption regarding the reproductive mode?

The population calling model by default models a population that satisfies the Hardy-Weinberg principle assumptions (other than diploidy).

Will it be confused that all alleles are transmitted clonaly to the descendant?

Well, this violates the HWE assumptions, but it won't "confuse" the model; the model still allows any combination of genotypes. It's just that the prior is not as informative as it could be. How this ultimately affects genotyping accuracy will largely depend on your data.

If I am interested in private variants, i.e. variants only seen in a single of the descendant, is the joint calling mode appropriate at all? One could assume that Octopus will be biased towards shared alleles. Hence, it would be more appropriate to perform calling on each single population independently and then cross compare them, without any joint calling.

You can do that using the population calling model, just add the option --use-independent-genotype-priors. You're at least getting consistent variant reporting that way.

Probably this isn't the best thing to do though; the default HWE population model prior may not be 'correct', but it's still likely better than assuming an independence prior. It's a fairly weak prior and is unlikely to result in missed calls unless your population is very large. A quick back of the envelope calculation.. Assume you have s diploid samples and two alleles at a position, then the HWE prior assigns phred probability -10 * log10(2 * 1/(2*s) * (2*s-1)/(2*s)) to the heterozygous genotype. So when s=1000 this is 30, which could easily be offset by one read-base observation. Of course you have the segregation prior which usually offsets a second good read-base observation, so you're looking for maybe 3 good supporting reads for the variant in your sample with the other 999 all strongly supporting the reference.

In summary, Octopus doesn't have a prior that is an ideal match for your experimental design, but the default population calling model is likely your best bet. I'd be more concerned about false positives than false negatives.

ghost commented 4 years ago

Hello, thanks for the detailed explanation. Well actually I really meant "population", as a sample means "a clonal population". But all the individuals in the population should be strictly identical at the genetic level. It's a microorganism so I have no other choice but to work on populations (not enough DNA in one individual for sequencing without PCR).

It is not easy to determine if joint calling is the way to go or not. I reasoned that going with individual calls and then cross comparing might strongly enrich false positives. On the other hand, joint calling introduces assumptions that are not met either.