iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Pandora heavily undermapping #325

Closed leoisl closed 9 months ago

leoisl commented 1 year ago

This is happening with one of our roundhound runs, but in general we know pandora undermaps reads to the PRGs. In some cases, heavy undermapping happens and this impacts genotyping. This is a breaking issue for roundhound and in this issue we will discuss some things about it and try to solve it.

In one of our plasmid genes, minimap2 maps 1605 reads while pandora maps only 718. All reads that pandora maps minimap2 also maps. I will thus extract the unmapped reads and debug why.

This relates to https://github.com/rmcolq/pandora/issues/316 . Although reads undemap more in the edge of genes, we can find undermapping throughout the whole gene

leoisl commented 1 year ago

General stats

Gene: cpe003_contig_3_00225

Conclusion

This data shows that pandora is able to map almost the same number of reads that minimap2 maps to cpe003_contig_3_00225 (minimap2 can map 1605 reads, pandora can map 1539 reads, a loss of 4% reads, which is totally acceptable IMO and likely due to lower density of minimising kmers on the edges). So there is no significant problem with pandora mapping algorithm or parameters. The main issue is that we don't handle multimapping in pandora. 822 of the 1539 reads pandora can map to cpe003_contig_3_00225 end up being mapped to 3 other genes because they map slightly better there. Solution is to implement multimapping in pandora.

iqbal-lab commented 9 months ago

Is this open still?

leoisl commented 9 months ago

No, undermapping is not an issue IMO, pandora does map at a similar rate as minimap2, but we might see undermapping as an effect of pandora not multimapping to several different genes. Release 0.12.0-alpha.0 improves this slightly, but if we want to fairly compare it to tools like minimap2 we can only do so if we implement multimapping in pandora. Closing this as the real issue is that we need to implement multimapping, but I am not even sure if this feature is required...