Closed ctb closed 1 year ago
(this came up during the discussion of tax grep
over in https://github.com/sourmash-bio/sourmash/pull/2178#issuecomment-1206647255, and also seems relevant to some of the bigger select-on-metadata ideas out there e.g. https://github.com/sourmash-bio/sourmash/issues/2180)
Implemented in #2333 - so, for example, the new summarize command would print out:
% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv
== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom: 2 distinct identifiers
rank phylum: 25 distinct identifiers
rank class: 32 distinct identifiers
rank order: 42 distinct identifiers
rank family: 52 distinct identifiers
rank genus: 60 distinct identifiers
rank species: 84 distinct identifiers
and
% sourmash tax prepare -t SRR606249-k31.x.gtdb.gather.with-lineages.csv -o zzz.csv -F csv
works as well.
semicolon-separated lineages and gather with-lineages
output is now natively supported as a taxonomy spreadsheet and can be used with all tax
commands per https://github.com/sourmash-bio/sourmash/pull/2333 🎉
when we use
sourmash tax annotate
on gather results, we produce a column with semicolon-separated lineages in it. we don't have many (any?) sourmash subcommands that natively ingest that format, although we do have some parsing code here https://github.com/sourmash-bio/sourmash/issues/2041 for metacoder.might be nice to think about tooling that easily interconverts between semicolon separated lineages and comma separated lineages, or something.