micom-dev / micom

Python package to study microbial communities using metabolic modeling.
https://micom-dev.github.io/micom
Apache License 2.0
82 stars 17 forks source link

Support for GTDB taxonomy? #36

Open nick-youngblut opened 3 years ago

nick-youngblut commented 3 years ago

Checklist

Is your feature related to a problem? Please describe it.

The Genome Taxonomy Database (GTDB) is comprehensive (especially the new v202 release) and more robust than the NCBI microbial taxonomy, especially given that the GTDB taxonomy is completely based off of genome phylogenic relatedness.

Although the MICOM docs are vague about the taxonomy that one must use, it appears that the NCBI taxonomy is required.

Describe the solution you would like.

Provide direct support for the GTDB taxonomy.

cdiener commented 3 years ago

MICOM doesn't really set any requirements for the taxonomy but you are right that you usually need the taxonomy of your data to match the taxonomy of the model database.

I also thought about providing the model databases with different taxonomies but haven't found a good way to map NCBI taxon IDs to GTDB ones. If you know of a way to do so that would be great. Otherwise, we would have to get all the original genomes from the database and classify them but that would be pretty involved because it is not straightforward to get the genomes for the AGORA models for instance.

nick-youngblut commented 3 years ago

I also thought about providing the model databases with different taxonomies but haven't found a good way to map NCBI taxon IDs to GTDB ones

You could use or build on a simple script that I wrote to map the NCBI taxonomy to the GTDB taxonomy: ncbi-gtdb_map.py. It simply uses the metadata provided by the GTDB, which includes NCBI and GTDB taxonomies for each genome.

If you need to map at the taxid level, some of the other scripts in that repo might be useful.

cdiener commented 3 years ago

Oh cool, will try with that one.

cdiener commented 1 year ago

It's a bit embarrassing it took so long because I lumped this in with the general revamp of DB construction. But you can now find GTDB databases at https://zenodo.org/record/7739096 . For now I removed taxa where a single species maps to several species/genera in GTDB but I'm open for better suggestions.

PathogeNish commented 10 months ago

Hi @cdiener, just to confirm, the agora201_gtdb207_genus_1.qza file is a genus level aggregation of the agora2 (7000+ strain) model database using GTDB nomenclature?

cdiener commented 10 months ago

Yes that is correct. With the caveat mentioned above that I had to remove taxa that did not cleanly map to GTDB. The release page has links to the manifests of all included genera.

PathogeNish commented 10 months ago

Hi @cdiener, I downloaded the raw sequence (WMGS) data from the micom paper GitHub and ran classification using MetaPhlAn4. I then considered two separate specific cases:

  1. Chocophlan taxonomy.
  2. Using the provided Chocophlan to GTDB tool to convert to GTDB taxonomy.

using the build function to build a community resulted in only ~20-30 samples with >80% coverage.

  1. The Chocophlan taxonomy has their own SGB nomenclature that doesn't work with either NCBI or GTDB so this is understandable.
  2. The converted GTDB taxonomies didn't show up consistently in the model db _agora201_gtdb207_genus1.qza.

This seems to indicate that the caveat you mentioned is quite strong because not many bacterial models are passing the filter into their GTDB names.

What do you think the best way to proceed will be?

cdiener commented 10 months ago

Hi @PathogeNish, hmm there could be a bunch of things going on. Can you share the metaphlan output table? Also did you filter unclassified genera before you calculated the coverage? uSGBs can probably not be matched well I would suspect. Another possiblity is a GTDB version mismatch. Some phyla got renamed recently so if you match in strict more that could be an issue.