Search a database with multiple genomes

jsgounot commented 2 years ago

Hi,

I would like to compare multiple genomes (stored into one single signature file) against the GTDB database with a min similarity threshold.

Sourmash search looks limited to one single query genome at a time.
Sourmach compare forces the creation of the full matrix, which is memory intensive and not necessary for most of the usages.

Did I miss an obvious way to do it?

Regards, JS

ctb commented 2 years ago

hi @jsgounot this isn't currently possible with the command line, I'm afraid.

If you were using a shell script or a workflow system, you could loop through all the md5sums of the query signatures in a file query.zip with sourmash search --md5 <query md5> query.zip database.zip.

Alternatively, you could use sourmash sig split to break the query signatures into many individual files.

jsgounot commented 2 years ago

hi @ctb, thanks for the quick answer. Noted. Any chance to see this in the future? To be honest, querying a list of bins against various database was 95% of my usage of mash and implementing this would greatly improve my usage of sourmash.

ctb commented 2 years ago

thinking out loud -

we implemented multigather for similar reasons, but it's been a pain to maintain and we're thinking about removing it per https://github.com/sourmash-bio/sourmash/issues/1614 - increasing the CLI surface area for testing is not my favorite thing to do :)
since sourmash is not (yet) multithreaded and Python threading is not really that great, whatever we implemented would not be run in parallel for large numbers of queries
because of the speed / parallelization challenges, the way I would run something like this would be via a snakemake workflow (e.g. http://ivory.idyll.org/blog/2021-snakemake-simple-operations.html)
I think we do (or we should!) support output formats from individual 'search' execution such that concatenating the outputs should not cause confusion in keeping query/matches straight

We do also support multiple queries in the sourmash lca subcommands with sourmash lca classify --query <input.sig> and that seems like a straightforward approach here (and it's how we implemented multigather, too).

I also wonder if we could better support the kind of only mildly complicated one liner with parallel used in https://github.com/sourmash-bio/sourmash/issues/2066? That would resolve the speed / parallelization concerns above.

Also, I'll look at the mash approach and see if we can learn from their style ;)

jsgounot commented 2 years ago

I don't know if my point of view matters in this context but I don't think snakemake should be seen as an option here. As much as I love workflow managers, it's way too heavy for something like this. parallel makes more sense but there are still some aspects I see in favor of having a direct implementation within sourmash:

parallel might not be available in all systems
All these solutions lead to a redundant step where the database is reloaded for each query and the generation of output for each query
I have experiences where parallel silenced some errors on some configuration
It would be more straightforward to be able to do it, especially since some users will instinctively create a signature containing multiple samples

I don't know if this helps, but here is my default implementation of multiprocessing with python:

import sys
from multiprocessing import Pool, cpu_count

class FunArgs() :

    def __init__(self, fun, * args, ** kwargs) :
        self.fun = fun
        self.args = args
        self.kwargs = kwargs

    def launch_fun(self) :
        return self.fun(* self.args, ** self.kwargs)

class Multiprocess() :

    @staticmethod
    def lambda_fun(farg) :
        return farg.launch_fun()

    def run(self, fargs, ncore=1, quiet=False) :
        if not quiet: print ("ncore : %i - available : %i" %(ncore, cpu_count()))
        if ncore == 1 : return [farg.launch_fun() for farg in fargs]

        pool = Pool(ncore)
        func = Multiprocess.lambda_fun
        fargs = ((farg, ) for farg in fargs)

        try :
            data = pool.starmap(func, fargs)
            pool.close()
            pool.join()
            return data

        except KeyboardInterrupt :
            print("Interrupt childs process")
            pool.terminate()
            sys.exit()

def mproc(fargs, ncore=1, quiet=False) :
    mp = Multiprocess()
    return mp.run(fargs, ncore=ncore, quiet=quiet)

Each call to a function is instantiated with the creation of a FunArgs which are then given to the mproc function.

ctb commented 2 years ago

I don't know if my point of view matters in this context

more information better, and experience is always welcome!

but I don't think snakemake should be seen as an option here. As much as I love workflow managers, it's way too heavy for something like this. parallel makes more sense but there are still some aspects I see in favor of having a direct implementation within sourmash:

parallel might not be available in all systems

All these solutions lead to a redundant step where the database is reloaded for each query and the generation of output for each query

we are hopefully moving towards a world where most sourmash databases have extremely low loading latency! I think the only really problematic ones out there are the LCA databases.

I have experiences where parallel silenced some errors on some configuration

ok, that's not good!

It would be more straightforward to be able to do it, especially since some users will instinctively create a signature containing multiple samples

we definitely want to support stuff that users will do :).

ctb commented 1 year ago

hi @jsgounot you might be interested in the discussion over here,

https://github.com/sourmash-bio/sourmash/issues/2458

which I think ends up converging a lot with this one :).

The plugin experiments might be particularly interesting -

sketchall plugin now has three different implementations, snakemake, multiproc, multithread: https://github.com/sourmash-bio/sourmash_plugin_sketchall/pulls
the PyO3-based wrapping of branchwater, here, is like ridiculously fast and low memory 😅

Neither of these are fully mature, but that can change rapidly b/c they're plugins and don't need to go through the same release process as sourmash.

So, just a heads up for the moment, but hopefully more coming!

jsgounot commented 1 year ago

Hi @ctb,

that's great news that things are changing this way. I'll try one next time I need it.

Thanks for the update! JS

ctb commented 1 year ago

hi @jsgounot the pyo3_branchwater plugin is getting pretty mature - you might be interested in the manysearch and multisearch commands. In particular, you can do 80k x 80k genome comparisons in under 5 GB RAM in 90 minutes on 64 CPUs with multisearch. It's still got some inconveniences compared to the full sourmash CLI, but it's coming along!

sourmash-bio / sourmash

Search a database with multiple genomes #2069