The main feature in this PR is removing the prg::Path path attribute from the MiniRecord struct. The prg::Path represents the path of nodes the minimiser stems from in the linear string representation of the PRG. It is used to sort minimiser hits when a minimiser from a read multimaps to several places in a graph. However, we don't take into account the places the minimiser was hit in the graph when defining and filtering clusters of hits. Therefore, the usefulness of this attribute is limited. Furthermore, a good approximation to this attribute is to instead use the kmer_node_id in the Kmer Graph of the PRG, as prg::Paths that appear earlier in the PRG string will likely have lower ids as the ids correspond to the topological sort order.
Thus, I don't expect much difference in output/results when replacing prg::Path by kmer_node_id in MiniRecord. This was also the observation when running pandora on the 4-way data (see https://github.com/rmcolq/pandora/issues/330#issuecomment-1590903732). Illumina results were actually slightly better with this change, while nanopore results remained the same. The motivation for this change is that it allows us to significantly reduce RAM usage and also improves runtime (e.g.: improves RAM usage by 60% and runtime by 35% in the roundhound use case for a sample).
This is the final optimisation PR regarding roundhound use cases. Compared with the current pre-release, this is the benchmark:
pandora compare with the RH plasmid DB (~1M PRGs) and the ESBL sample SRR16977031:
All the recent PRs (lazy loading + read info optimisation + index optimisation, unreleased):
RAM usage: 9.1 GB (95% less RAM than baseline)
Runtime: 8.35 minutes (15.5 times faster than baseline)
This PR also features some very few minor stuff, e.g.:
passing some shared pointers by reference instead of copy;
not changing error rate if user gave bad params (e.g. --illumina and high error rate);
There will still be a final PR after this one doing several very minor fixes, and then finally a last one to prepare the pre-release.
The main feature in this PR is removing the
prg::Path path
attribute from theMiniRecord
struct. Theprg::Path
represents the path of nodes the minimiser stems from in the linear string representation of the PRG. It is used to sort minimiser hits when a minimiser from a read multimaps to several places in a graph. However, we don't take into account the places the minimiser was hit in the graph when defining and filtering clusters of hits. Therefore, the usefulness of this attribute is limited. Furthermore, a good approximation to this attribute is to instead use thekmer_node_id
in the Kmer Graph of the PRG, asprg::Path
s that appear earlier in the PRG string will likely have lower ids as the ids correspond to the topological sort order.Thus, I don't expect much difference in output/results when replacing
prg::Path
bykmer_node_id
inMiniRecord
. This was also the observation when runningpandora
on the 4-way data (see https://github.com/rmcolq/pandora/issues/330#issuecomment-1590903732). Illumina results were actually slightly better with this change, while nanopore results remained the same. The motivation for this change is that it allows us to significantly reduce RAM usage and also improves runtime (e.g.: improves RAM usage by 60% and runtime by 35% in the roundhound use case for a sample).This is the final optimisation PR regarding roundhound use cases. Compared with the current pre-release, this is the benchmark: pandora compare with the RH plasmid DB (~1M PRGs) and the ESBL sample SRR16977031:
v0.10.0-alpha.0
(baseline, current release) RAM usage: 178.1 GB Runtime: 130 minutesAll the recent PRs (lazy loading + read info optimisation + index optimisation, unreleased): RAM usage: 9.1 GB (95% less RAM than baseline) Runtime: 8.35 minutes (15.5 times faster than baseline)
This PR also features some very few minor stuff, e.g.:
--illumina
and high error rate);There will still be a final PR after this one doing several very minor fixes, and then finally a last one to prepare the pre-release.