iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Improving indexing and mapping performance with low cost #311

Open leoisl opened 1 year ago

leoisl commented 1 year ago

Currently our main indexing structure, which maps kmer hashes to their PRGs and locations, is a std::unordered_map. There are better alternatives, like boost::unordered_flat_map (a review here: https://bannalia.blogspot.com/2022/11/inside-boostunorderedflatmap.html?m=1) and robin hood hashing (unsure about this one, but nice to document, https://github.com/martinus/unordered_dense). There might be even more suited data structures making use of the fact that after built our index is immutable

iqbal-lab commented 1 year ago

(fwiw John Lees and Johanna in the office next to us has been using robin hood hashing and got Dan into it too)

leoisl commented 1 year ago

Might be worth trying Google's SparseHash (https://github.com/sparsehash/sparsehash). Lower memory usage but slower, could have some use cases (e.g. roundhound)

leoisl commented 1 year ago

Andreas has more experience than us on these data structures, and he recommends using this map: https://gitlab.ub.uni-bielefeld.de/gi/sans/-/blob/kc/src/tsl/sparse_map.h (best compromise for both speed and RAM). We might have hash maps that use even less RAM with a cost of being (much) slower, which is sth that we might consider for tools like RH