luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

Getting AD information for specific sites #146

Closed bredelings closed 3 years ago

bredelings commented 3 years ago

Hi, I'm trying to get the read depth for reference and alternative alleles at variant sites for a bunch of individuals.

The approach that people use with GATK is to jointly call the individuals and look at the AD field.

I've been using the -C polyclone --max-clones 2 for the individual samples. I guess there isn't a polyclone population model? So I tried precomputing a list of variant sites to check --- but then octopus might not emit some of these sites.

Is there a way to make octopus emit the relevant sites? Ideas that occurred to me:

Alternatively, I could try and call 10 or so individuals under the population model with -P 1 or -P 2 and use variant sites that are emitted in that file, since I presume that all individuals would have an AD field for sites that are variant in any single individual.

Would you have a suggestion?

Thanks for your help! -BenRI

Version

$ octopus --version
octopus version 0.7.0 (develop b83ce113)
Target: x86_64 Linux 2.6.32-642.11.1.el6.x86_64
SIMD extension: SSE2
Compiler: GNU 10.1.0
Boost: 1_74
bredelings commented 3 years ago

Using the population model works. A population+polyclone model would be interesting.

dancooke commented 3 years ago

Sorry for the delay in responding to this - busy thesis writing at the moment! Unfortunately there is no polyclone-population model at the moment, but the prospect has been raised a few times on the issue pages so I may give it greater thought at some point. Using the population model is one approach, but you could also just call each sample independently with the polyclone and merge the results.

bredelings commented 3 years ago

Thanks -- of course thesis-writing has to be first priority! I'm glad to hear that a polyclone-population model is at least being thought about.

I would like to call each sample with polyclone and then merge... but if I run the polyclone model for a single individual, it won't write out the AD counts for sites that don't vary in that individual. Perhaps I could make it output EVERY site, and then merge the samples, and THEN remove sites that are identical in all individuals? I'm not quite sure how to do that though.