luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Joint calling is not fully supported #12

Closed dancooke closed 5 years ago

dancooke commented 6 years ago

Currently, the population calling model uses an independence based model, so samples are not jointly genotyped. Whilst this is still better than individually calling all samples and merging calls, as octopus can borrow haplotypes generated from other samples, and will at least output consistent calls. It is not optimal as this approach doesn't leverage genotype prior information across samples. Specifically, PopulationModel found in src/core/models/genotype/population_model.hpp needs implementing.

In addition, when more than a few samples are used for joint calling, the number of candidate variants can become intractably large. A better strategy would be to individually call each sample using a low variant posterior threshold, and then re-calling all the samples together using only the called variants. This can easily be done manually, but it would be nice to automate the process.

dancooke commented 6 years ago

As of 67003ea joint calling is supported for samples with the same ploidy across all contigs - we still can't joint call samples with different ploidy.

dancooke commented 5 years ago

Full support as of 53b93421b3d2bf2f125eb9043575a60622708dc1.

jblamyatifremer commented 4 years ago

Dear Dan,

It seems that the joint calling model is only working for "population" model...

We are working with "cancer" or "polyclone" models, and we suspect to have some shared haplotypes between samples.

Do you plan to implement joint analysis ? Should you recommand to make the joint analysis by hand with two pass on the samples with an intermediate "trusted vcf" ?

Cheers, JB

dancooke commented 4 years ago

@jblamyatifremer The cancer calling model supports joint calling of multiple tumour samples from the same individual. I'm not currently planning to extend this model to allow joint calling of multiple individuals as the model complexity would be significant and the benefit small.

I can see a stronger use-case for having joint calling in the polyclone model, but I can also imagine several possible experimental designs that lead to different modelling assumptions. What exactly is your experimental design?

jblamyatifremer commented 4 years ago

Your tool is quite impressive in term of accuracy even in hard design as cancer only with tumor !

About the cancer : there is a growing number of "transmissible cancer" that are charaterized in various taxa (https://en.wikipedia.org/wiki/Clonally_transmissible_cancer and Ostrander EA, et al. (2016). "Transmissible tumors: breaking the cancer paradigm". Trends in Genetics. 32 (1): 1–15. doi:10.1016/j.tig.2015.10.001. ISSN 0168-9525. PMC 4698198. PMID 26686413. In other term, neoplasia cell from an individual 1 could grow in another individual 2 and develop a cancer (scary).

Our experimental designs for the polyclone model are the following : 1- Each years (during 5 years) we infect individuals, in natura, and we expect that one haplotype from the previous year (t-1) start the the infection and "gives birth" to new haplotypes.... 2- In an experimental evolution, we infect individual with a mix of 5 viral isolates, we collect the virus from this first infection and we use it to infect new individuals and so one during 50 times. Cheers, JB

dancooke commented 4 years ago

Transmissible cancers are certainly very interesting. I did consider adding a calling model for these a while ago but the modelling was complex and the use-case fairly limited so decided to focus attention elsewhere. Maybe I'll revisit in the future.

I'll have a think regarding the polyclone model. I can see the use-case but it's not immediately obvious to me how to go about modelling it.