sketch names in GTDB database use NCBI taxonomy

ccbaumler commented 9 months ago

I am working on this pangenome database idea at https://github.com/ctb/2022-database-covers/. I may have found an error in the taxonomic classification while trying to figure some stuff out with the scripts.

For example, from the gtdb-rs214.lineages.csv:

GCA_000398885.1,f,d__Bacteria,p__Pseudomonadota,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia ruysiae

When I extract that signature from the GTDB db:

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /home/baumlerc/2022-database-covers/ruysiae-or-coli.zip
signature: GCA_000398885.1 Escherichia coli KTE33 strain=KTE33, Esch_coli_KTE33_V1
source file: /dev/fd/63
md5: 5fa7a957bf01b4a8bfbc2b3265e79d34
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4489
sum hashes: 4489
signature license: CC0

loaded 1 signatures total, from 1 files

The GenBank link implies that this should be considered E. coli https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000398885.1/

bluegenes commented 9 months ago

huh. Thanks for finding, @ccbaumler! We used make-gtdb-taxonomy.py file, which you can see here: https://github.com/sourmash-bio/database-releases/pull/3/files#diff-f2b14d2992941e5da6adff7c90f2d2ed91b1cca96a5a84ee8ca0e6e92feb03d7

https://github.com/sourmash-bio/database-releases/pull/3

ctb commented 9 months ago

they probably changed it. they do that.

ccbaumler commented 9 months ago

I can rerun the script to create the lineage spreadsheet, but I don't see how a GenBank update could affect the GTDB and lineage file. We created those at the same time and they should be accurate to each other. The only way GenBank update could affect this is if we recreated one file and not the other.

ctb commented 9 months ago

wait I don't understand. apologies if I'm missing something obvious 😓

GTDB is distinct from GenBank taxonomy. They use the same genome accessions, but the taxonomies are intentionally not concordant.

when you say "the genbank link implies..." that is something that genbank changes from time to time.

ctb commented 9 months ago

here, the genbank link matches what's in the GTDB zipfile, which is an assembly record named based on the Genbank classification.

the GTDB folk have decided it's a different Escherichia, and the taxonomy file for GTDB assigns that GenBank access to s__Escherichia ruysiae.

Probably the most confusing thing I see here is that we did not actually rename the sketch in the zip file to match the GTDB classification, but kept it as the GenBank name... computers are hard 😭

ccbaumler commented 9 months ago

Thank you for walking through everything! I see my own misunderstanding clearly now.

Would it be best to update the signature name with the species rank name from the lineage spreadsheet and a unique identifier?

"Computers are hard." - Titus

ctb commented 9 months ago

Would it be best to update the signature name with the species rank name from the lineage spreadsheet and a unique identifier?

sounds like work

ccbaumler commented 9 months ago

It's this kind of wisdom that keeps me showing up day after day.

ctb commented 9 months ago

so, some slightly more sober thoughts ;).

I agree that it's confusing that when you find matches with search/gather in a database named 'gtdb', you get the NCBI genome names, which basically use taxonomic names from NCBI.

The options to fix this would appear to be, roughly -

rename the sketches in the gtdb databases to match the GTDB taxonomy. This is certainly do-able, but breaks the principle that we're sketching the genomes/using the genome names as they are given to us by the source - which is NCBI. It might make tracking down the source of genome sketches just one step harder, too. Here it helps that would leave the accessions the same, since GTDB uses the same accession information as GenBank.
remove the names altogether and just go with accession. This would be annoying to me (and presumably other users) because I like seeing slightly more informative output in sourmash gather and so on.
leave it as it is - no work needs to be done. Certainly the easiest thing to do :).

I'm tempted to leave this issue open (maybe changing the issue title), or maybe create a new one, and then see what we feel like doing the next time GTDB does a release and we need to update.

I agree it is mildly confusing as it is, but we've been doing this for years and you're the first person who has noticed - kudos on paying attention 😆 ! - so maybe it's a minor confusion in a tapestry of many larger confusions? :). As it is, GTDB mostly agrees with NCBI taxonomy at the family/genus level, too, so it's not that jarring a disconnect in practice.

ccbaumler commented 9 months ago

I think if we add more info into the signature name and discuss it in the documentation, it would alleviate any confusion. It could also show in the output that your return tax has some contradictions to investigate as well.

Last night I was looking through the script @bluegenes linked and the metadata files that were used to create the lineage files. I think we could update the sig names throughout the database to include the gtdb_genome_representative accession with the gtdb_taxonomy species name. Something like:

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /home/baumlerc/2022-database-covers/ruysiae-or-coli.zip

signature: GCA_000398885.1 Escherichia coli KTE33 strain=KTE33, Esch_coli_KTE33_V1 (GTDB_GCA_021307345.1 GTDB_Escherichia ruysiae)

source file: /dev/fd/63
md5: 5fa7a957bf01b4a8bfbc2b3265e79d34
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4489
sum hashes: 4489
signature license: CC0

loaded 1 signatures total, from 1 files

Appendix

First few columns of `bac120_metadata_r214.tsv`:	accession	ambiguous_bases	checkm_completeness	checkm_contamination	checkm_marker_count	checkm_marker_lineage	checkm_marker_set_count	checkm_strain_heterogeneity	coding_bases	coding_density	contig_count	gc_count	gc_percentage	genome_size	gtdb_genome_representative	gtdb_representative	gtdb_taxonomy
GB_GCA_000398885.1	0	98.61	0.04	1173	f__Enterobacteriaceae (UID5124)	336	0	3884601	84.7628338874941	220	2275196	50.6641441436627	4582906	GB_GCA_021307345.1	f	dBacteria;pPseudomonadota;cGammaproteobacteria;oEnterobacterales;fEnterobacteriaceae;gEscherichia;s__Escherichia ruysiae

The two columns in `bac120_taxonomy_r214.tsv`:
GB_GCA_021307345.1	dBacteria;pPseudomonadota;cGammaproteobacteria;oEnterobacterales;fEnterobacteriaceae;gEscherichia;s__Escherichia ruysiae

luizirber commented 9 months ago

Back when I added NCBI genome calculation to wort I started building the name for the sig using these fields from assembly_summary_genbank.txt. At that point in time the only place we had to save this metadata was inside the signature, and it was this name that was pulled when reporting on gather (and was the solution for this item from Titus list:

remove the names altogether and just go with accession. This would be annoying to me (and presumably other users) because I like seeing slightly more informative output in sourmash gather and so on.

There is a whole discussion to be had on splitting "fast-changing" metadata (taxonomy, names) from "slow-changing" metadata/data (accession ID, the actual hashes in the minhash), and "name" is one of these fields that fall more in "fast" than "slow". But it is sort of the only one we have at the signature level.

But, turns out we have a better solution for collections nowadays: manifests!

We can add an extra column to the manifest for "preferred"/"overwrite" name, and use that (if present) when outputting in gather/search/any other place reporting names. Manifest are easy to change/update, don't mess with the signatures (so we can reuse NCBI sigs for GTDB, for example), and fit nicely into the "fast"/"slow" divide above.

ctb commented 9 months ago

We can add an extra column to the manifest for "preferred"/"overwrite" name, and use that (if present) when outputting in gather/search/any other place reporting names. Manifest are easy to change/update, don't mess with the signatures (so we can reuse NCBI sigs for GTDB, for example), and fit nicely into the "fast"/"slow" divide above.

👍

ccbaumler commented 9 months ago

Thanks for all the great explanations, ideas and details!

One point I'm having trouble rectifying in my mind is the making signatures and taxonomies from GTDB data but naming the signature as NCBI taxonomic name. Intuitively, I would expect the sig "name" to be the GTDB taxonomic name.

ctb commented 9 months ago

Separation of name and taxonomy! A sourmash superpower (in many ways) is that we can key multiple different taxonomies on the same accession. But the flip side is that the name (which contains the accession plus some human-readable text) is distinct from taxonomy.

(IMO the content is what matters mostly, anyway. Names are ephemeral. Taxonomies change. Content is (closer to) eternal!

ccbaumler commented 9 months ago

Truth, the representative genomic signature is the most important part of the database. While names are ephemeral, we are still using them. My understanding is we are using them in contradictory ways by making the database name and the signature names from two different sources.

And, for taxonomic profiling, would it be more intuitive for users reading their gather output for the database to be called NCBI instead of GTDB because the signature names derive from the NCBI and not the GTDB?

Then when using alternate taxonomic naming databases, they would expect to see some opposing taxonomies... ?

ctb commented 9 months ago

And, for taxonomic profiling, would it be more intuitive for users reading their gather output for the database to be called NCBI instead of GTDB because the signature names derive from the NCBI and not the GTDB?

but 'gather' doesn't produce taxonomic output... tax metagenome does.

What we'd really want to call the database is something like gtdb-subset-of-ncbi. Which feels likely to be even more confusing.

ccbaumler commented 9 months ago

Please correct me if I misunderstood your comment about liking to see the output. I was referring to the fifth column of gather which returns the signature names of the database against the query.

If I use the GTDB sourmash database but it is returning NCBI taxonomies in that output... I don't know. Doesn't seem to be presenting the output intuitively in my mind. Especially when the output could differ from GTDB like the initial comment on this thread points out.

ctb commented 9 months ago

yes, I understand.

someone needs to propose a specific solution, and then someone needs to do the work. The person doing the work generally gets a large say in which solution is chosen :).

ctb commented 8 months ago

We can add an extra column to the manifest for "preferred"/"overwrite" name, and use that (if present) when outputting in gather/search/any other place reporting names. Manifest are easy to change/update, don't mess with the signatures (so we can reuse NCBI sigs for GTDB, for example), and fit nicely into the "fast"/"slow" divide above.

After talking through this with @ccbaumler a bit, I am endorsing this solution a bit more strongly -

renaming things is something we want to be able to do! adding an override name column to manifests strikes me as a good minimal change that enables that easily, and is also debuggable (as opposed to simply updating the name in a manifest and using that as a default override, which ...is not very explicit and hence not very debuggable)
if we do that, we should also consider adding a 'deprecated' boolean column per https://github.com/sourmash-bio/sourmash/issues/985#issuecomment-1065907568

ctb commented 8 months ago

if we're going to upgrade manifest contents to include deprecated/removed and preferred_name, we could also:

think about deprecating filename in manifests, 'cause it's dumb
add missing fields like seed and license per https://github.com/sourmash-bio/sourmash/issues/1849
maybe add a notes field

this would (in my opinionated opinion) be distinct from adding other metadata fields, tags, and so on, as explored in my comments on https://github.com/sourmash-bio/branchwater/issues/18

sourmash-bio / sourmash

sketch names in GTDB database use NCBI taxonomy #3006

Appendix