sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

Implications of multiple signatures in a single .sig #1624

Open gustavo-salazar opened 3 years ago

gustavo-salazar commented 3 years ago

Hey again,

So I noticed that the .sig files can either be a single signature, or an array of signatures. What are the implications of including multiple signatures in a single file? particularly interested for the gather command.

Is it the same to run gather once for a file with multiple signatures, than run it for each signature? What about the CSV output? 1 csv includes the results for all the signatures or multiple files are created?

ctb commented 3 years ago

hi @gustavo-salazar,

if you have multiple signatures in the query file, you will need to pick one using the k-mer size/moltype selectors (-k, --dna, etc.). The code in src/sourmash/commands.py will complain as is - see the sourmash_args.load_query_signature(...) function call around line 647.

if you have multiple signatures passed as subjects to search, the "correct" ksize/moltype will be picked out and used for the search.

gustavo-salazar commented 3 years ago

aah, so the idea of multiple signatures in the same file is to have multiple signatures of the same sequence file, and they are in the same file, because they are using different settings k-mer size or moltype.

That's where I was confused I thought I could create a single file that groups signatures of multiple files.

One of our goals is to allow our users to choose multiple files, e.g. all the contig files of a metagenomic sample. And what I plan to do is to create signatures for each of those files, and then run gather for each of those signatures. When I noticed the signature file was an array I thought I would be able to construct a single file and run gather only once; hence this question.

Thanks for your help

ctb commented 3 years ago

aah, so the idea of multiple signatures in the same file is to have multiple signatures of the same sequence file, and they are in the same file, because they are using different settings k-mer size or moltype.

yep, that's the common use case!

although there are lots of other use cases -

That's where I was confused I thought I could create a single file that groups signatures of multiple files.

One of our goals is to allow our users to choose multiple files, e.g. all the contig files of a metagenomic sample. And what I plan to do is to create signatures for each of those files, and then run gather for each of those signatures. When I noticed the signature file was an array I thought I would be able to construct a single file and run gather only once; hence this question.

Yep! Note that something similar was requested in https://github.com/sourmash-bio/sourmash/issues/1198 and is straightforward to do with the Python API.

We don't currently support this via the default gather CLI because until recently there wasn't any way to optimize the gather search given multiple queries - 5 queries would take 5 times as long as 1 query. So it seemed like an unnecessary complication. But, with the addition of prefetch functionality in 4.1.0, we could potentially merge multiple query signatures, do a prefetch with a single pass across the database to pull out all relevant matches, and then run gather on all the results. (This is easy to do with the API.)

You might look into sourmash multigather, although it doesn't offer much in the way of optimizations beyond database loading, and (unless you or someone else tells me it's useful?) it might be deprecated in 5.0, see https://github.com/sourmash-bio/sourmash/issues/1614. Interestingly the above idea for optimization using prefetch to do a single pass is potentially a good reason to keep multigather around 😆 and would require only minimal code changes 🤔 . With picklists and manifests it would not even need to store bulky intermediate results. IIiiiiiinteresting idea... that would be useful in some other projects, too.

As a side note, picklists (https://github.com/sourmash-bio/sourmash/pull/1588/ - see docs) and manifests (https://github.com/sourmash-bio/sourmash/pull/1590) will probably both be in the next release, and this will enable support for rapid and flexible subsetting of large databases and/or results from searching them. Happy to explain the performance tradeoffs more, if and when you're interested.

For an example use case for this functionality, one of my big use cases for picklists is to do taxonomic subsetting of searches, so you'd be able to say, "only search/return results for Firmicutes", and depending on the database type, you'd potentially get a much faster search of just the Firmicute signatures.

Thanks for your help

Great questions, keep on asking! I'll try to keep my answers from being too long and confusing...!

gustavo-salazar commented 3 years ago

Great questions, keep on asking! I'll try to keep my answers from being too long and confusing...!

Your answers are awesome! I just need time and brain power to process them!

My current implementations is literally just calling gather from sourmash.commands🙈 but I can see how simple it is to have more control of it in the example in #1198 I'll investigate more in this direction.

ctb commented 3 years ago

Great questions, keep on asking! I'll try to keep my answers from being too long and confusing...!

Your answers are awesome! I just need time and brain power to process them!

👍

My current implementations is literally just calling gather from sourmash.commands🙈 but I can see how simple it is to have more control of it in the example in #1198 I'll investigate more in this direction.

Right - and I'd like to continue iterating on the API to meet your needs, so that you don't end up with a boatload of custom code that then needs to be refactored to match latest/greatest sourmash code. Very happy to have this be a two way street - you tell us what API you're using and what stability guarantees you'd like, and we'll put them in our tests so that at the very least we know when we're breaking stuff that you are using (and can bump the version numbers appropriately).