luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

Incorporating octopus haplotype identification into existing microbial genomics program #156

Closed MrOlm closed 3 years ago

MrOlm commented 3 years ago

Hello,

Congratulations on the recent publication, and thank you for making this tool so well-documented and applicable to a wide range of use-cases. I wrote and maintain a tool for microbial population genomics (https://www.nature.com/articles/s41587-020-00797-0 / https://github.com/MrOlm/instrain), and I'm interested in using octopus to add support for haplotype calling. I have two questions along these lines-

1) Is this something you're comfortable with? InStrain has an open-source MIT license as well, and I would of course acknowledge your program as the source of haplotype calls and ask uses to cite it if they use the haplotype information.

2) If you are comfortable with (1), do you have any recommendations on the best way to call octopus? It would always be in polyclone mode. InStrain already calls SNPs from a .bam file, so I'd like to avoid making octopus re-process the .bam file to call SNPs again. In your view, would the best way to do this be to have users install octopus as a dependency, have inStrain write an intermediate VCF file to disk, and just do a system call out to octopus using the --source-candidates option on the intermediate VCF file?

Thanks in advance, Matt

dancooke commented 3 years ago

Hi Matt,

I'm very happy for you to integrate Octopus into your tool (which looks very interesting!).

My feeling on the best option would be for your tool to somehow install Octopus. That way you have better control over versions etc. Perhaps take a look at bcbio for ideas on ways to do this.

In terms of using Octopus, you can supply your own SNV calls with --source-candidates, although just this option by itself won't stop Octopus generating it's own candidates (the variants you provide will just be added to the candidate set); you'll also need --disable-denovo-variant-discovery if you want Octopus to only consider your SNVs. I would be cautious in doing this however as you may get suboptimal results given you're only supplying SNVs. It's important to bear in mind that Octopus haplotype calls fall out of the genotyping model, and that can be thrown off if it's not given the true haplotypes as candidates.

Cheers Dan

MrOlm commented 3 years ago

Great, thanks for the quick reply and encouragement! Just one follow up question regarding --source-candidates and --disable-denovo-variant-discovery - if I provide all SNVs (multiallelic sites) and all fixed mutations (where all reads disagree with the reference sequence) in the provide VCF file, would this overcome the problem you're talking about?

Thanks again, Matt

dancooke commented 3 years ago

if I provide all SNVs (multiallelic sites) and all fixed mutations (where all reads disagree with the reference sequence) in the provide VCF file, would this overcome the problem you're talking about?

No - I was referencing that you're not providing indels or complex substitutions.

MrOlm commented 3 years ago

Got it- makes sense and thanks for all the info. I'll post back here if I hit any hiccups