Closed taylorreiter closed 3 years ago
I'm on board!
Is the idea that this would be available for sourmash 4.2? https://github.com/dib-lab/sourmash/issues/1481
do you envision that the taxonomy spreadsheet format would be changing much, or would those be independent changes?
This is something we started talking about. The spreadsheet for NCBI currently has fields (synonymous with) accession
, taxonid
, superkingdom
, phylum
, class
, order
, family
, genus
, species
and I think strain
. @bluegenes has made her GTDB lineage spreadsheets without the second column taxonid
and without strain
I think. @luizirber, @bluegenes and I agreed that the first column should remain as the dataset identification in database (e.g. unique identifier from SBT [NOT MD5]; assembly accession). After that, we like taxonid for NCBI, but that doesn't necessarily fit with GTDB...and it may be hand to have excess columns over the lineage. So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming.
One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.
a few quick thoughts based on thinking while running - feel free to reject
lca classify
for single genomes and lca summarize
for multiple--query
and --db
style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.re @taylorreiter comments --
So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming. One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.
I think this use case is a strong argument in support of using column names and enabling multiple lineage spreadsheets to be read in separately. As we read in each lineage csv, we would require that certain columns exist, but otherwise be flexible (e.g. - these can contain additional information that we just ignore). This also would provide a set of standardized guidelines for folks to build their own lineage spreadsheets if they need.
re: @ctb comments --
presumably sourmash tax/taxonomy would be another set of subcommands?
yep!
I like lca classify for single genomes and lca summarize for multiple
yes! Though I think we were talking about dropping the lca
-- so sourmash tax classify
or sourmash tax summarize
just to keep things succinct.
We somewhat decided to focus on the summarize
function first, just to narrow the scope. classify
is pretty similar to summarize (with a few important diffs), so if we don't get to it now, I will work on it after the hackathon.
the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.
I think what you're saying is +1
to enabling multiple query/db input in a single command? Agree, and I'll also drop in that a --from-file
style input would be very useful, at least for classify
, which needs to read in the output of gather run on each genome to be classified.
re @taylorreiter comments --
So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming. One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.
I think this use case is a strong argument in support of using column names and enabling multiple lineage spreadsheets to be read in separately. As we read in each lineage csv, we would require that certain columns exist, but otherwise be flexible (e.g. - these can contain additional information that we just ignore). This also would provide a set of standardized guidelines for folks to build their own lineage spreadsheets if they need.
good!
re: @ctb comments --
I like lca classify for single genomes and lca summarize for multiple
yes! Though I think we were talking about dropping the
lca
-- sosourmash tax classify
orsourmash tax summarize
just to keep things succinct.
absolutely, especially since we won't be using LCA methods in the same way :)
note that (b/c of semantic versioning) we won't be removing the lca commands completely until v6 at the earliest. But we can deprecate them for v5.
We somewhat decided to focus on the
summarize
function first, just to narrow the scope.classify
is pretty similar to summarize (with a few important diffs), so if we don't get to it now, I will work on it after the hackathon.
k!
the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.
I think what you're saying is
+1
to enabling multiple query/db input in a single command? Agree, and I'll also drop in that a--from-file
style input would be very useful, at least forclassify
, which needs to read in the output of gather run on each genome to be classified.
yep!
Note - when processing lineages, we should try to ignore assembly version info (.[12]
) -- these shouldn't matter for taxonomy, and make things a bit more complicated when an updated assembly is added or an assembly version is redacted (as is currently the case for one in gtdb-rs202
)
Note - when processing lineages, we should try to ignore assembly version info (
.[12]
) -- these shouldn't matter for taxonomy, and make things a bit more complicated when an updated assembly is added or an assembly version is redacted (as is currently the case for one ingtdb-rs202
)
running into exactly this with sourmash lca index
! working on some patches here, https://github.com/dib-lab/sourmash/pull/1542, would be nice to fix this up front in sourmash taxonomy
code!
Organization question, mainly for @ctb, but also everyone:
How do we want to split functionality between lca
and tax
? Since lca
already has excellent handy lineage utilities, a simple way to structure the division would be to keep all lineage parsing / manipulation over in lca_utils
, and keep tax
functions focused around a) parsing and summarizing gather output (using lca functions internally) and b) conversion to output formats for use in visualization.
thoughts?
Organization question, mainly for @ctb, but also everyone:
How do we want to split functionality between
lca
andtax
? Sincelca
already has excellent handy lineage utilities, a simple way to structure the division would be to keep all lineage parsing / manipulation over inlca_utils
, and keeptax
functions focused around a) parsing and summarizing gather output (using lca functions internally) and b) conversion to output formats for use in visualization.thoughts?
suggest converse - copy or move functions over to tax
, and change lca
to reference them (if unchanged) or not (if changed).
To my understanding, lca
will become a special case and/or be deprecated in the future, and I think all taxonomy stuff should be moved under tax
.
It might complicated things during the hackathon, tho, so it's totally OK to just leave things as they are and reference the lca
functions from tax
; this also gives you the flexibility to customize the tax
functions where needed. we can move them over later, as long as everything is tested (similar to what we are doing with sourmash compute
vs sourmash sketch
).
suggest converse - copy or move functions over to
tax
, and changelca
to reference them (if unchanged) or not (if changed).To my understanding,
lca
will become a special case and/or be deprecated in the future, and I think all taxonomy stuff should be moved undertax
.
wonderful. I didn't want to suggest this because I was worried about backwards compatibility, but ofc, can just reference the functions in lca
!
It might complicated things during the hackathon, tho, so it's totally OK to just leave things as they are and reference the
lca
functions fromtax
; this also gives you the flexibility to customize thetax
functions where needed. we can move them over later, as long as everything is tested (similar to what we are doing withsourmash compute
vssourmash sketch
).
good point. Will start with copying over the functions we use directly, to allow modification as needed during the hackathon.
ref https://github.com/dib-lab/charcoal/issues/174:
It's probably a good idea for gather_at_rank to detect and handle/report such ties, and probably pull the taxonomic assignment up to the level above the tie.
good point. Will start with copying over the functions we use directly, to allow modification as needed during the hackathon.
maybe: copy on write?
import from lca until you need to change, then when you need to change, copy.
e.g. the tree/LCA stuff is unlikely to need changes, but the taxonomy loading stuff is ...questionable :)
preserving from slack
bluegenes:feet:
is there any need for a sourmash tax label
(or similar), where we just add lineage information into the gather results, with no summarization at all?
titus:speech_balloon: I kinda like that! you could imagine something like describe or display that would give you something human readable, OR just have it be straight up CSV output
bluegenes:feet: ooh, definitely this is also making me think of some reporting folks might like out of classify — x% of genomes classified at species, etc
titus:speech_balloon: yep
I think everything in here is covered by #1543! At this point someone(s) should revisit https://github.com/sourmash-bio/sourmash/issues/969 and create a new "summary" issue that contains the remaining ideas, but no urgency.
@luizirber and @bluegenes and I have been getting more excited about a
sourmash taxonomy
command, and potentially tackling pieces of it in a DIB lab hackathon. We had a conversation about this and wanted to summarize the main points here, as well as continue brainstorming.Goal: command line interface that takes one or multiple
sourmash gather
csvs and a lineage csv and provides taxonomic rank summarization and downstream formatting for ingestion by popular taxonomy visualization tools.Relevant Issues:
Relevant Repos:
assembly_stats.txt
. Alternative to2018-ncbi-lineages
that parses the genbankassembly_stats.txt
for the assembly accession and taxon id./home/irber/sourmash_databases/outputs/lca/lineages
What needs to be included in
sourmash taxonomy
? Command line interface, inputs and outputs: What should the command line interface look like? What should the format of the inputs and outputs be? What functionality should be included?sourmash gather
csv filessuperkingdom
)convert
/liftover
<- convert between taxonomiessummarize
; similar tolca summarize
cami
format output <- (from https://github.com/luizirber/2020-cami)krona
format outputnewick
format output?Think about for the future