moving toward `sourmash taxonomy` for taxonomy reporting and manipulation from `sourmash gather` results

taylorreiter commented 3 years ago

@luizirber and @bluegenes and I have been getting more excited about a sourmash taxonomy command, and potentially tackling pieces of it in a DIB lab hackathon. We had a conversation about this and wanted to summarize the main points here, as well as continue brainstorming.

Goal: command line interface that takes one or multiple sourmash gather csvs and a lineage csv and provides taxonomic rank summarization and downstream formatting for ingestion by popular taxonomy visualization tools.

Relevant Issues:

Relevant Repos:

https://github.com/dib-lab/2018-ncbi-lineages
https://github.com/dib-lab/sourmash_databases
https://github.com/dib-lab/2019-12-12-sourmash_viz
https://github.com/luizirber/2020-cami
https://github.com/dib-lab/sourmash_databases/pull/11 <- build databases from assembly_stats.txt. Alternative to 2018-ncbi-lineages that parses the genbank assembly_stats.txt for the assembly accession and taxon id.
- currently on farm /home/irber/sourmash_databases/outputs/lca/lineages

What needs to be included in sourmash taxonomy? Command line interface, inputs and outputs: What should the command line interface look like? What should the format of the inputs and outputs be? What functionality should be included?

inputs:
- one or multiple sourmash gather csv files
- one or more lineage csv files (one lineage file per SBT used in gather)
  - column 1: dataset identification in database (e.g. unique identifier from SBT [NOT MD5]; assembly accession)
  - formatting of subsequent columns needs to be standardized, or parsed from standardized column names (e.g. superkingdom)
commands:
- convert/liftover <- convert between taxonomies
  - e.g. GTDB <-> NCBI using GTDB csv map
  - or GTDB version conversion (e.g. r95 -> rs202)
- summarize; similar to lca summarize
  - cami format output <- (from https://github.com/luizirber/2020-cami)
  - krona format output
  - newick format output?

Think about for the future

lineage as a manifest packaged with the zip sbt database

ctb commented 3 years ago

I'm on board!

Is the idea that this would be available for sourmash 4.2? https://github.com/dib-lab/sourmash/issues/1481

ctb commented 3 years ago

do you envision that the taxonomy spreadsheet format would be changing much, or would those be independent changes?

taylorreiter commented 3 years ago

This is something we started talking about. The spreadsheet for NCBI currently has fields (synonymous with) accession, taxonid, superkingdom, phylum, class, order, family, genus, species and I think strain. @bluegenes has made her GTDB lineage spreadsheets without the second column taxonid and without strain I think. @luizirber, @bluegenes and I agreed that the first column should remain as the dataset identification in database (e.g. unique identifier from SBT [NOT MD5]; assembly accession). After that, we like taxonid for NCBI, but that doesn't necessarily fit with GTDB...and it may be hand to have excess columns over the lineage. So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming.

One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.

ctb commented 3 years ago

a few quick thoughts based on thinking while running - feel free to reject

presumably sourmash tax/taxonomy would be another set of subcommands?
I like lca classify for single genomes and lca summarize for multiple
the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.
column names == better than what we're doing in lca! I think it's fine to be extra special and hardcode ncbi and gtdb taxon names in as things that can be recognized and combined, since they're so ubiquitous.
combining multiple taxonomies like you say above is a really good use case...

bluegenes commented 3 years ago

re @taylorreiter comments --

So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming. One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.

I think this use case is a strong argument in support of using column names and enabling multiple lineage spreadsheets to be read in separately. As we read in each lineage csv, we would require that certain columns exist, but otherwise be flexible (e.g. - these can contain additional information that we just ignore). This also would provide a set of standardized guidelines for folks to build their own lineage spreadsheets if they need.

re: @ctb comments --

presumably sourmash tax/taxonomy would be another set of subcommands?

yep!

I like lca classify for single genomes and lca summarize for multiple

yes! Though I think we were talking about dropping the lca -- so sourmash tax classify or sourmash tax summarize just to keep things succinct.

We somewhat decided to focus on the summarize function first, just to narrow the scope. classify is pretty similar to summarize (with a few important diffs), so if we don't get to it now, I will work on it after the hackathon.

the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.

I think what you're saying is +1 to enabling multiple query/db input in a single command? Agree, and I'll also drop in that a --from-file style input would be very useful, at least for classify, which needs to read in the output of gather run on each genome to be classified.

ctb commented 3 years ago

re @taylorreiter comments --

So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming. One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.

I think this use case is a strong argument in support of using column names and enabling multiple lineage spreadsheets to be read in separately. As we read in each lineage csv, we would require that certain columns exist, but otherwise be flexible (e.g. - these can contain additional information that we just ignore). This also would provide a set of standardized guidelines for folks to build their own lineage spreadsheets if they need.

good!

re: @ctb comments --

I like lca classify for single genomes and lca summarize for multiple

yes! Though I think we were talking about dropping the lca -- so sourmash tax classify or sourmash tax summarize just to keep things succinct.

absolutely, especially since we won't be using LCA methods in the same way :)

note that (b/c of semantic versioning) we won't be removing the lca commands completely until v6 at the earliest. But we can deprecate them for v5.

We somewhat decided to focus on the summarize function first, just to narrow the scope. classify is pretty similar to summarize (with a few important diffs), so if we don't get to it now, I will work on it after the hackathon.

k!

the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.

I think what you're saying is +1 to enabling multiple query/db input in a single command? Agree, and I'll also drop in that a --from-file style input would be very useful, at least for classify, which needs to read in the output of gather run on each genome to be classified.

yep!

bluegenes commented 3 years ago

Note - when processing lineages, we should try to ignore assembly version info (.[12]) -- these shouldn't matter for taxonomy, and make things a bit more complicated when an updated assembly is added or an assembly version is redacted (as is currently the case for one in gtdb-rs202)

ctb commented 3 years ago

Note - when processing lineages, we should try to ignore assembly version info (.[12]) -- these shouldn't matter for taxonomy, and make things a bit more complicated when an updated assembly is added or an assembly version is redacted (as is currently the case for one in gtdb-rs202)

running into exactly this with sourmash lca index! working on some patches here, https://github.com/dib-lab/sourmash/pull/1542, would be nice to fix this up front in sourmash taxonomy code!

bluegenes commented 3 years ago

Organization question, mainly for @ctb, but also everyone:

How do we want to split functionality between lca and tax? Since lca already has excellent handy lineage utilities, a simple way to structure the division would be to keep all lineage parsing / manipulation over in lca_utils, and keep tax functions focused around a) parsing and summarizing gather output (using lca functions internally) and b) conversion to output formats for use in visualization.

thoughts?

ctb commented 3 years ago

Organization question, mainly for @ctb, but also everyone:

How do we want to split functionality between lca and tax? Since lca already has excellent handy lineage utilities, a simple way to structure the division would be to keep all lineage parsing / manipulation over in lca_utils, and keep tax functions focused around a) parsing and summarizing gather output (using lca functions internally) and b) conversion to output formats for use in visualization.

thoughts?

suggest converse - copy or move functions over to tax, and change lca to reference them (if unchanged) or not (if changed).

To my understanding, lca will become a special case and/or be deprecated in the future, and I think all taxonomy stuff should be moved under tax.

It might complicated things during the hackathon, tho, so it's totally OK to just leave things as they are and reference the lca functions from tax; this also gives you the flexibility to customize the tax functions where needed. we can move them over later, as long as everything is tested (similar to what we are doing with sourmash compute vs sourmash sketch).

bluegenes commented 3 years ago

suggest converse - copy or move functions over to tax, and change lca to reference them (if unchanged) or not (if changed).

To my understanding, lca will become a special case and/or be deprecated in the future, and I think all taxonomy stuff should be moved under tax.

wonderful. I didn't want to suggest this because I was worried about backwards compatibility, but ofc, can just reference the functions in lca!

It might complicated things during the hackathon, tho, so it's totally OK to just leave things as they are and reference the lca functions from tax; this also gives you the flexibility to customize the tax functions where needed. we can move them over later, as long as everything is tested (similar to what we are doing with sourmash compute vs sourmash sketch).

good point. Will start with copying over the functions we use directly, to allow modification as needed during the hackathon.

bluegenes commented 3 years ago

ref https://github.com/dib-lab/charcoal/issues/174:

It's probably a good idea for gather_at_rank to detect and handle/report such ties, and probably pull the taxonomic assignment up to the level above the tie.

ctb commented 3 years ago

good point. Will start with copying over the functions we use directly, to allow modification as needed during the hackathon.

maybe: copy on write?

import from lca until you need to change, then when you need to change, copy.

e.g. the tree/LCA stuff is unlikely to need changes, but the taxonomy loading stuff is ...questionable :)

bluegenes commented 3 years ago

preserving from slack

bluegenes:feet: is there any need for a sourmash tax label (or similar), where we just add lineage information into the gather results, with no summarization at all?

titus:speech_balloon: I kinda like that! you could imagine something like describe or display that would give you something human readable, OR just have it be straight up CSV output

bluegenes:feet: ooh, definitely this is also making me think of some reporting folks might like out of classify — x% of genomes classified at species, etc

titus:speech_balloon: yep

ctb commented 3 years ago

I think everything in here is covered by #1543! At this point someone(s) should revisit https://github.com/sourmash-bio/sourmash/issues/969 and create a new "summary" issue that contains the remaining ideas, but no urgency.

sourmash-bio / sourmash

moving toward `sourmash taxonomy` for taxonomy reporting and manipulation from `sourmash gather` results #1515