Closed leoisl closed 1 year ago
Pinging @iqbal-lab @Danderson123 - should we wait/keep reviewing or should I merge?
Sorry for this, but I am super busy helping michael, adrian, brice with their papers, and finding it hard to get space to review. Also, a review from me is not the same as from someone completely on top of the codebase. i vote merge
This PR mainly adds the lazy loading feature to pandora, which just loads a graph from the index if reads map to it in map/discover/compare. This reduces the RAM usage considerably in very large PanRGs, like the plasmid ones, where we have ~1M graphs. The RAM/runtime improvement compared to the current release is, WRT to the ESBL sample
SRR16977031
:v0.10.0-alpha.0
(baseline, current release) RAM usage: 178.1 GB Runtime: 130 minutesThis PR: RAM usage: 124.5 GB (30% less RAM than baseline) Runtime: 31.8 minutes (4 times faster than baseline)
It also adds a secondary, minor feature, where if a read maps equally well to several graphs, we choose one at random. The current code would systematically choose a single graph instead, thus missing the other mappings.
Warning: unit tests are not compiling, so we removed them for now. This increases our tech debt, we will have to solve this soon. We tested that our changes did not introduce any serious bugs on real data, see https://github.com/rmcolq/pandora/issues/330 for details