sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

translating between taxonomies - maybe a `sourmash tax translate`? #2201

Open ctb opened 2 years ago

ctb commented 2 years ago

there is some interest in translating between taxonomies (GTDB, NCBI, and maybe LINS), and this is something that we should be able to do somewhat straightforwardly in sourmash.

relevant issues -

a few brainstorming notes and thoughts -

GTDB only provides taxonomy for bacteria and archaea, not euks or viruses; same for LINS.

here I'm mostly thinking about using sourmash to translate between taxonomic annotations that have been made elsewhere (with or without sourmash);

note GTDB is included within Genbank, so for ~300,000 genomes there is already a 1:1 mapping.

my basic idea is to build mapping tables for NCBI lineages into GTDB lineages, by using sourmash gather and sourmash tax genome on NCBI genomes, and then... publish them!

this would need a new command, maybe sourmash tax translate, that would take two taxonomy spreadsheets (--from-tax and --to-tax maybe?) in a variety of formats (currently accepted, as well as biom https://github.com/sourmash-bio/sourmash/issues/2199 and semicolon separated https://github.com/sourmash-bio/sourmash/issues/2185?) and do the translation for ya.

in the fullness of time this could become a way for people with results in one taxonomy to do a mapping to another; this should maybe be discouraged in situations where you have genomes (just use sourmash gather on those genomes!) but could be useful for people who are using other tax classification programs.

ctb commented 2 years ago

would be good to identify places where translation was not round-trip (A->B->A)

ctb commented 2 years ago

see discussion in "LINgroups as a Principled Approach to Compare and Integrate Multiple Bacterial Taxonomies" paper!