Open jsgounot opened 2 years ago
hi @jsgounot this isn't currently possible with the command line, I'm afraid.
If you were using a shell script or a workflow system, you could loop through all the md5sums of the query signatures in a file query.zip
with sourmash search --md5 <query md5> query.zip database.zip
.
Alternatively, you could use sourmash sig split
to break the query signatures into many individual files.
hi @ctb, thanks for the quick answer. Noted. Any chance to see this in the future? To be honest, querying a list of bins against various database was 95% of my usage of mash and implementing this would greatly improve my usage of sourmash.
thinking out loud -
We do also support multiple queries in the sourmash lca
subcommands with sourmash lca classify --query <input.sig>
and that seems like a straightforward approach here (and it's how we implemented multigather, too).
I also wonder if we could better support the kind of only mildly complicated one liner with parallel
used in https://github.com/sourmash-bio/sourmash/issues/2066? That would resolve the speed / parallelization concerns above.
Also, I'll look at the mash approach and see if we can learn from their style ;)
I don't know if my point of view matters in this context but I don't think snakemake
should be seen as an option here. As much as I love workflow managers, it's way too heavy for something like this. parallel
makes more sense but there are still some aspects I see in favor of having a direct implementation within sourmash
:
parallel
might not be available in all systemsI don't know if this helps, but here is my default implementation of multiprocessing with python:
import sys
from multiprocessing import Pool, cpu_count
class FunArgs() :
def __init__(self, fun, * args, ** kwargs) :
self.fun = fun
self.args = args
self.kwargs = kwargs
def launch_fun(self) :
return self.fun(* self.args, ** self.kwargs)
class Multiprocess() :
@staticmethod
def lambda_fun(farg) :
return farg.launch_fun()
def run(self, fargs, ncore=1, quiet=False) :
if not quiet: print ("ncore : %i - available : %i" %(ncore, cpu_count()))
if ncore == 1 : return [farg.launch_fun() for farg in fargs]
pool = Pool(ncore)
func = Multiprocess.lambda_fun
fargs = ((farg, ) for farg in fargs)
try :
data = pool.starmap(func, fargs)
pool.close()
pool.join()
return data
except KeyboardInterrupt :
print("Interrupt childs process")
pool.terminate()
sys.exit()
def mproc(fargs, ncore=1, quiet=False) :
mp = Multiprocess()
return mp.run(fargs, ncore=ncore, quiet=quiet)
Each call to a function is instantiated with the creation of a FunArgs
which are then given to the mproc
function.
I don't know if my point of view matters in this context
more information better, and experience is always welcome!
but I don't think
snakemake
should be seen as an option here. As much as I love workflow managers, it's way too heavy for something like this.parallel
makes more sense but there are still some aspects I see in favor of having a direct implementation withinsourmash
:
parallel
might not be available in all systemsAll these solutions lead to a redundant step where the database is reloaded for each query and the generation of output for each query
we are hopefully moving towards a world where most sourmash databases have extremely low loading latency! I think the only really problematic ones out there are the LCA databases.
- I have experiences where parallel silenced some errors on some configuration
ok, that's not good!
- It would be more straightforward to be able to do it, especially since some users will instinctively create a signature containing multiple samples
we definitely want to support stuff that users will do :).
hi @jsgounot you might be interested in the discussion over here,
https://github.com/sourmash-bio/sourmash/issues/2458
which I think ends up converging a lot with this one :).
The plugin experiments might be particularly interesting -
Neither of these are fully mature, but that can change rapidly b/c they're plugins and don't need to go through the same release process as sourmash.
So, just a heads up for the moment, but hopefully more coming!
Hi @ctb,
that's great news that things are changing this way. I'll try one next time I need it.
Thanks for the update! JS
hi @jsgounot the pyo3_branchwater plugin is getting pretty mature - you might be interested in the manysearch
and multisearch
commands. In particular, you can do 80k x 80k genome comparisons in under 5 GB RAM in 90 minutes on 64 CPUs with multisearch
. It's still got some inconveniences compared to the full sourmash CLI, but it's coming along!
Hi,
I would like to compare multiple genomes (stored into one single signature file) against the GTDB database with a min similarity threshold.
search
looks limited to one single query genome at a time.compare
forces the creation of the full matrix, which is memory intensive and not necessary for most of the usages.Did I miss an obvious way to do it?
Regards, JS