sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

tax in zip files; supporting multiple taxonomies with command-line switches #2216

Open ctb opened 2 years ago

ctb commented 2 years ago

In https://github.com/sourmash-bio/sourmash/issues/2154, we've been talking about how to include taxonomic information in zipfiles, and I've been trying to figure out how that would work at the command line.

But all the discussion happened in a now-closed issue and a now-merged PR ;). So here's a new issue!

Comments copied over from various other issues and PRs -

From https://github.com/sourmash-bio/sourmash/pull/2195#issuecomment-1213275883, @bluegenes:

As discussed on slack :) -

re: SOURMASH-TAXONOMY — would you consider GTDB-TAXONOMY and NCBI-TAXONOMY instead, with the default being gtdb?

OR, somewhere in database info/metadata (which we don’t have yet, but have talked about), add the default for that database? In this case, I'm thinking about database info/metadata as database version (e.g. gtdb-rs207), sourmash signature version, creation date, etc -- and then adding default-taxonomy.

From https://github.com/sourmash-bio/sourmash/issues/2012#issuecomment-1214419578, I wrote:

Trying to figure out how distributing multiple taxonomies in a zip file would work at the command line.

The most obvious idea is:

sourmash tax classify -g gather.csv -t gtdb-xyz.zip --gtdb

which would load GTDB-TAXONOMY.csv from gtdb-xyz.zip, vs

sourmash tax classify -g gather.csv -t gtdb-xyz.zip --ncbi

which would load NCBI-TAXONOMY.csv from gtdb-xyz.zip.

Then we could potentially add --lins later on for https://github.com/sourmash-bio/sourmash/issues/1813.

Alternative command-line switches would be --tax-type ncbi or something but I feel like --ncbi and --gtdb are probably simplest and easiest to remember.


which received @bluegenes endorsement:

I like --gtdb and --ncbi, especially since I can't see us integrating so many taxonomies that having an argument per tax type would be unwieldy.

--lins definitely useful when we get there!

Also sorta connects with https://github.com/sourmash-bio/sourmash/issues/2186, searching/selecting on taxonomic lineages?

ctb commented 2 years ago

found this comment from @bluegenes, buried in a different issue - it appears to be the original we-should-have-tax-in-zip idea -

Additional thought: It would be handy to include the taxonomy file inside each database file (possible with zip, sbt.zip, and sqldb and not needed for lca, right?). That would reduce extra download code and the need to link the correct taxonomy file with each database. For taxonomy functions with official databases, users could provide the database on the command line (instead of needing to find/download the taxonomy file), and we could automatically find it. I would imagine TAXONOMY.csv, complementary to manifest file. We would still allow alternate taxonomies, of course, but at least each db would come with the official set for that db?

ctb commented 1 year ago

also ref https://github.com/nf-core/taxprofiler/pull/404, where it would clearly be nice to have just one file containing sketches + taxonomy CSV.