luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Call in high heterozygotes #265

Closed Axze-rgb closed 1 year ago

Axze-rgb commented 1 year ago

Hello,

usually when we think about mapping reads and calling, we have in mind the human genome. But I am dealing with an organism that has between 2 and 3 heterozygosity, even more at the telomeres. This causes an issue: since illumina reads map with a lot of mismatches, callers tend to ditch a lot of what we think are valid SNPs (because we have limited long reads data supporting that fact). They get ditch because either the reads maps with a relatively poor score due to the many mismatches, or the SNP are too close to one another. I see this with GATK, for example. GATK still works for giving big trends, but I am interested in developing a more refined analysis.

Is there a way with octopus to deal with high heterozygosity? So, basically the 2 issues to deal with:

I don't understand Octopus well enough to know what parameters to tweak. I will happily try any suggestion.