iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Lazy loading #331

Closed leoisl closed 1 year ago

leoisl commented 1 year ago

This PR mainly adds the lazy loading feature to pandora, which just loads a graph from the index if reads map to it in map/discover/compare. This reduces the RAM usage considerably in very large PanRGs, like the plasmid ones, where we have ~1M graphs. The RAM/runtime improvement compared to the current release is, WRT to the ESBL sample SRR16977031:

v0.10.0-alpha.0 (baseline, current release) RAM usage: 178.1 GB Runtime: 130 minutes

This PR: RAM usage: 124.5 GB (30% less RAM than baseline) Runtime: 31.8 minutes (4 times faster than baseline)

It also adds a secondary, minor feature, where if a read maps equally well to several graphs, we choose one at random. The current code would systematically choose a single graph instead, thus missing the other mappings.

Warning: unit tests are not compiling, so we removed them for now. This increases our tech debt, we will have to solve this soon. We tested that our changes did not introduce any serious bugs on real data, see https://github.com/rmcolq/pandora/issues/330 for details

leoisl commented 1 year ago

Pinging @iqbal-lab @Danderson123 - should we wait/keep reviewing or should I merge?

iqbal-lab commented 1 year ago

Sorry for this, but I am super busy helping michael, adrian, brice with their papers, and finding it hard to get space to review. Also, a review from me is not the same as from someone completely on top of the codebase. i vote merge