iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
110 stars 14 forks source link

Index optimisation #342

Closed leoisl closed 1 year ago

leoisl commented 1 year ago

The main feature in this PR is removing the prg::Path path attribute from the MiniRecord struct. The prg::Path represents the path of nodes the minimiser stems from in the linear string representation of the PRG. It is used to sort minimiser hits when a minimiser from a read multimaps to several places in a graph. However, we don't take into account the places the minimiser was hit in the graph when defining and filtering clusters of hits. Therefore, the usefulness of this attribute is limited. Furthermore, a good approximation to this attribute is to instead use the kmer_node_id in the Kmer Graph of the PRG, as prg::Paths that appear earlier in the PRG string will likely have lower ids as the ids correspond to the topological sort order.

Thus, I don't expect much difference in output/results when replacing prg::Path by kmer_node_id in MiniRecord. This was also the observation when running pandora on the 4-way data (see https://github.com/rmcolq/pandora/issues/330#issuecomment-1590903732). Illumina results were actually slightly better with this change, while nanopore results remained the same. The motivation for this change is that it allows us to significantly reduce RAM usage and also improves runtime (e.g.: improves RAM usage by 60% and runtime by 35% in the roundhound use case for a sample).

This is the final optimisation PR regarding roundhound use cases. Compared with the current pre-release, this is the benchmark: pandora compare with the RH plasmid DB (~1M PRGs) and the ESBL sample SRR16977031:

This PR also features some very few minor stuff, e.g.:

There will still be a final PR after this one doing several very minor fixes, and then finally a last one to prepare the pre-release.