luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Option to call a set of know variants in cancer model in addition to classic calling? #155

Closed alxsimon closed 3 years ago

alxsimon commented 3 years ago

Hi, this is a usage question about the cancer model calling.

I have just started looking at octopus and wondered if it could apply to my use case: tumour only calling in a non-model organism. I need to call both somatic variants and a set of known germline variants from the population. I did not see in the documentation if there was a way to call both at the same time.

This would maybe be the equivalent in Freebayes of:

-@ --variant-input VCF                                                        
                Use variants reported in VCF file as input to the algorithm.  
                Variants in this file will included in the output even if     
                there is not enough support in the data to pass input filters.

Or in GATK Mutect2 a combination of --panel-of-normals and --genotype-pon-sites.

I don't think the planned option --regenotype would fit this need, as this would not look for unknown somatic or germline variants.

Thanks

dancooke commented 3 years ago

Hi, Octopus will automatically call both germline and somatic variants in cancer mode. Have you seen the wiki documentation, in particular the section on VCF output?

For tumour-only calling there will be a high degree of uncertainty in germline/somatic classification for some variants, so make sure to look at the PP INFO field.

In terms of supplying known variants, you can do this with the --source-candidates option, but this doesn't guarantee the sites will appear in the output and doesn't act like a panel of normals either.

alxsimon commented 3 years ago

Thanks for the links, I see what you mean.

I think my question was not clear on one detail, and correct me if I am wrong. Imagine a position that is not variable in the sample considered (all reads have the same allele), I imagine this would not appear in the output. I may need to output such informative variants (for example that I know fixed between two species or two populations) for latter analyses.

I think this can be rephrased to, I would like to be able to genotype not only variants segregating in the sample, but also in the population (which are known in advance).

Honestly this is just a detail and an edge case, as I think I would be able to retrieve this information with either another tool or maybe the individual calling model, this would just avoid two pass on the whole bam.

dancooke commented 3 years ago

If I understand correctly, you want to be able to call homozygous reference genotypes at particular sites? Unfortunately, the cancer calling model doesn't support this yet. There's limited support for reference calling (i.e., so called "gVCFs") in the individual calling model (see the --refcall option), but it's untested and can result in poor runtime/memory performance in some cases. I'm hoping to work more on this at some point.

alxsimon commented 3 years ago

Yes, both homozygous ref and alt alleles at a set of given sites. That's fine anyway, as I said this is an edge case and I will find a workaround. I just wanted to be sure this was not possible before going another way for this. Thanks for the quick answers!