iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Improve pandora discovery RAM usage #315

Open leoisl opened 1 year ago

leoisl commented 1 year ago

I am quite concerned with this part in denovo racon: https://github.com/rmcolq/pandora/blob/12a08c5483c19fc12411e174970d31c86e842a2d/src/denovo_discovery/discover_main.cpp#L205-L206

This is a dictionary from loci names to the subreads that map to each locus, inferred by pandora map. This structure could get potentially very large, as we basically store a substring of every read that map to each locus (is just the region of the read that maps to that specific locus, but still...). There are potentially many better ways to store this info, but I also want to avoid premature optimisation, and just work on this if RAM is indeed an issue.

Originally posted by @leoisl in https://github.com/rmcolq/pandora/issues/303#issuecomment-1297228115

iqbal-lab commented 1 year ago

Thumbs up