sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 79 forks source link

running gather on many (very) closely related genomes is (very) slow #1771

Closed ctb closed 2 years ago

ctb commented 2 years ago

(this is going to be anecdata for now, because I ran it a month or two back and forgot the details)

in brief, I wanted to play around with data from Generation of lineage-resolved complete metagenome-assembled genomes by precision phasing, which has many lineage-resolved genomes from a metagenome.

in particular, I wanted to see how well sourmash gather did in terms of searching a lineage-resolved bin against the Illumina metagenome.

The below files are on farm in ~ctbrown/2021-bickhart.

So, to the best of my recollection, I did the following -

and that basically never finished.

so this is a marker to explore what's going on :). I would naively have expected the prefetch stage to go fast, at least?

ctb commented 2 years ago

search key words so I can find this next time: slow, strain, long reads 😆

ctb commented 2 years ago

Starting to think this is mostly about gather being very slow when there are many very similar overlaps. Probably something that can be magicked away with a slightly different data structure.

ctb commented 2 years ago

I've been digging into this a little bit because of #2095, and wrote a little script to generate many closely related genomes: script make-many.py. This makes a bunch of closely related genomes that differ by only a few hashes

I need to do some actual profiling before I commit to this, but it looks to me like a likely culprit is the consume method in the CounterGather code, which for each call iterates through all stored matches in order to subtract intersections. When you have many closely related matches this ends up effectively being N^2 for the first few matches, at least.

One thought is that we could do some kind of Jaccard clustering of matches, perhaps using kSpider-style tactics; this might let us do more clever things with clustering the matches.

Another thought is to do more rapid containment analysis, e.g. invest in Bloom filter-augmented data structures or some such. This basically goes the route of SBTs and/or BIGSI, ref https://github.com/sourmash-bio/sourmash/issues/545 again.

Last but not least we could invest in more sophisticated data structures - AllSome or split sequence bloom trees would seem to be a better choice. Ref https://github.com/sourmash-bio/sourmash/issues/545.

I tried implementing a new CounterGather class on top of LCA_Database (aka a reverse index, which @luizirber has also been playing with - mine was in Python tho, not Rust ;). It was straightforward but quite a bit slower. I might try with the sqlite LCA database before giving up on it though.

the script itself

#! /usr/bin/env python
import sys
import random
import sourmash
from sourmash import sourmash_args

N_OUTPUT=1000
N_TO_ADD=10
POOLSIZE=1000

ss = sourmash.load_one_signature(sys.argv[1], ksize=31)
mh = ss.minhash

pool = list(range(POOLSIZE))

with sourmash_args.SaveSignaturesToLocation('all-sig.zip') as save_sig:
    for i in range(1000):
        mh_i = mh.to_mutable()

        random.shuffle(pool)
        mh_i.add_many(pool[:N_TO_ADD])

        ss_i = sourmash.SourmashSignature(mh_i, name=f"sig{i}")
        save_sig.add(ss_i)

mh_all = mh.to_mutable()
with sourmash_args.SaveSignaturesToLocation(f'query.sig.gz') as save_sig:
    mh_all.add_many(pool)
    ss_all = sourmash.SourmashSignature(mh_all, name="q")
    save_sig.add(ss_all)
ctb commented 2 years ago

(run the above script on some poor genome signature and then:

sourmash gather query.sig.gz all-sig.zip --threshold=0

to benchmark)

ctb commented 2 years ago

downsampled query and bins to scaled=10k and took a look with py-spy for k=31. Looks like 80% of the time is spent in this code from search.py, `GatherDatabases.init(...):

query_mh.remove_many(noident_mh)

putting in print statements, I saw:

XXX removing: 1264278 from 1379822
XXX time: 147.1

so removing 1.3m hashes from 1.4m sized MinHash is taking the time.

py-spy flamegraph below - the big chunk in the middle left is the remove_many call.

Screen Shot 2022-07-17 at 7 20 49 AM
mr-eyes commented 2 years ago

I am wondering what could be done to improve this:

https://github.com/sourmash-bio/sourmash/blob/401ba4873f2d81a86c3046d2b61613005dc8423a/src/core/src/sketch/minhash.rs#L418-L423

Is the function being called just once to remove the 1264278 hashes or multiple times?

ctb commented 2 years ago

the function is being called just once.

and whoops, I forgot to link in https://github.com/sourmash-bio/sourmash/issues/1617, where the remove_many problem is analyzed in gory detail and some possible solutions are discussed!

ctb commented 2 years ago

This can be sped up quite nicely by using union-of-intersections-with-query instead of subtracting unidentified hashes: https://github.com/sourmash-bio/sourmash/pull/2123

ctb commented 2 years ago

I'm going to close this in favor of https://github.com/sourmash-bio/sourmash/issues/1617 and https://github.com/sourmash-bio/sourmash/issues/2161 which nicely cover the remaining parts of this issue :)