sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
475 stars 80 forks source link

Search a database with multiple genomes #2069

Open jsgounot opened 2 years ago

jsgounot commented 2 years ago

Hi,

I would like to compare multiple genomes (stored into one single signature file) against the GTDB database with a min similarity threshold.

Did I miss an obvious way to do it?

Regards, JS

ctb commented 2 years ago

hi @jsgounot this isn't currently possible with the command line, I'm afraid.

If you were using a shell script or a workflow system, you could loop through all the md5sums of the query signatures in a file query.zip with sourmash search --md5 <query md5> query.zip database.zip.

Alternatively, you could use sourmash sig split to break the query signatures into many individual files.

jsgounot commented 2 years ago

hi @ctb, thanks for the quick answer. Noted. Any chance to see this in the future? To be honest, querying a list of bins against various database was 95% of my usage of mash and implementing this would greatly improve my usage of sourmash.

ctb commented 2 years ago

thinking out loud -

We do also support multiple queries in the sourmash lca subcommands with sourmash lca classify --query <input.sig> and that seems like a straightforward approach here (and it's how we implemented multigather, too).

I also wonder if we could better support the kind of only mildly complicated one liner with parallel used in https://github.com/sourmash-bio/sourmash/issues/2066? That would resolve the speed / parallelization concerns above.

Also, I'll look at the mash approach and see if we can learn from their style ;)

jsgounot commented 2 years ago

I don't know if my point of view matters in this context but I don't think snakemake should be seen as an option here. As much as I love workflow managers, it's way too heavy for something like this. parallel makes more sense but there are still some aspects I see in favor of having a direct implementation within sourmash:

I don't know if this helps, but here is my default implementation of multiprocessing with python:

import sys
from multiprocessing import Pool, cpu_count

class FunArgs() :

    def __init__(self, fun, * args, ** kwargs) :
        self.fun = fun
        self.args = args
        self.kwargs = kwargs

    def launch_fun(self) :
        return self.fun(* self.args, ** self.kwargs)

class Multiprocess() :

    @staticmethod
    def lambda_fun(farg) :
        return farg.launch_fun()

    def run(self, fargs, ncore=1, quiet=False) :
        if not quiet: print ("ncore : %i - available : %i" %(ncore, cpu_count()))
        if ncore == 1 : return [farg.launch_fun() for farg in fargs]

        pool = Pool(ncore)
        func = Multiprocess.lambda_fun
        fargs = ((farg, ) for farg in fargs)

        try :
            data = pool.starmap(func, fargs)
            pool.close()
            pool.join()
            return data

        except KeyboardInterrupt :
            print("Interrupt childs process")
            pool.terminate()
            sys.exit()

def mproc(fargs, ncore=1, quiet=False) :
    mp = Multiprocess()
    return mp.run(fargs, ncore=ncore, quiet=quiet)

Each call to a function is instantiated with the creation of a FunArgs which are then given to the mproc function.

ctb commented 2 years ago

I don't know if my point of view matters in this context

more information better, and experience is always welcome!

but I don't think snakemake should be seen as an option here. As much as I love workflow managers, it's way too heavy for something like this. parallel makes more sense but there are still some aspects I see in favor of having a direct implementation within sourmash:

  • parallel might not be available in all systems

  • All these solutions lead to a redundant step where the database is reloaded for each query and the generation of output for each query

we are hopefully moving towards a world where most sourmash databases have extremely low loading latency! I think the only really problematic ones out there are the LCA databases.

  • I have experiences where parallel silenced some errors on some configuration

ok, that's not good!

  • It would be more straightforward to be able to do it, especially since some users will instinctively create a signature containing multiple samples

we definitely want to support stuff that users will do :).

ctb commented 1 year ago

hi @jsgounot you might be interested in the discussion over here,

https://github.com/sourmash-bio/sourmash/issues/2458

which I think ends up converging a lot with this one :).

The plugin experiments might be particularly interesting -

Neither of these are fully mature, but that can change rapidly b/c they're plugins and don't need to go through the same release process as sourmash.

So, just a heads up for the moment, but hopefully more coming!

jsgounot commented 1 year ago

Hi @ctb,

that's great news that things are changing this way. I'll try one next time I need it.

Thanks for the update! JS

ctb commented 1 year ago

hi @jsgounot the pyo3_branchwater plugin is getting pretty mature - you might be interested in the manysearch and multisearch commands. In particular, you can do 80k x 80k genome comparisons in under 5 GB RAM in 90 minutes on 64 CPUs with multisearch. It's still got some inconveniences compared to the full sourmash CLI, but it's coming along!