sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

provide taxonomy operations that work on semicolon-separated lineages #2185

Closed ctb closed 1 year ago

ctb commented 2 years ago

when we use sourmash tax annotate on gather results, we produce a column with semicolon-separated lineages in it. we don't have many (any?) sourmash subcommands that natively ingest that format, although we do have some parsing code here https://github.com/sourmash-bio/sourmash/issues/2041 for metacoder.

might be nice to think about tooling that easily interconverts between semicolon separated lineages and comma separated lineages, or something.

ctb commented 2 years ago

(this came up during the discussion of tax grep over in https://github.com/sourmash-bio/sourmash/pull/2178#issuecomment-1206647255, and also seems relevant to some of the bigger select-on-metadata ideas out there e.g. https://github.com/sourmash-bio/sourmash/issues/2180)

ctb commented 1 year ago

Implemented in #2333 - so, for example, the new summarize command would print out:

% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv                   

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom:        2 distinct identifiers
rank phylum:              25 distinct identifiers
rank class:               32 distinct identifiers
rank order:               42 distinct identifiers
rank family:              52 distinct identifiers
rank genus:               60 distinct identifiers
rank species:             84 distinct identifiers

and

% sourmash tax prepare -t SRR606249-k31.x.gtdb.gather.with-lineages.csv -o zzz.csv -F csv

works as well.

ctb commented 1 year ago

semicolon-separated lineages and gather with-lineages output is now natively supported as a taxonomy spreadsheet and can be used with all tax commands per https://github.com/sourmash-bio/sourmash/pull/2333 🎉