motu-tool / mOTUs

motus - a tool for marker gene-based OTU (mOTU) profiling
GNU General Public License v3.0
144 stars 24 forks source link

Clarify ref-mOTUs taxonomic annotation #85

Closed AlessioMilanese closed 10 months ago

AlessioMilanese commented 2 years ago

It is not clear how the annotation of the ref-mOTUs is created.

AlessioMilanese commented 2 years ago

Ref-mOTUs are annotated using the NCBI annotation (from 8 January 2019) of the genomes within each ref-mOTU. This relates to mOTUs 2.5.1 and mOTUs 3.0.x.

Note that the ref-mOTUs annotation is derived directly from the specI annotation (which is done of 40 marker genes instead of the 10 marker genes used for mOTUs). From the specI annotation we additionally improve clusters where the majority of the genomes (at least 80%) agree into one annotation. For example a cluster composed of 100 genomes, where 99 genomes are annotated as Prevotella copri and 1 genome is annotated as Prevotella bivia, will be changed to be Prevotella copri. See also #72 and #76 for details.

Here is a summary of how the specI annotation was created:

Each specI is composed of 1 or more genomes. The annotation of speacIs is done progressively:

  1. We find singleton specIs (i.e. specIs composed of only one genome). In this case we use the NCBI taxonomy of that genome.
  2. We find specIs composed of more than one genome, but where all genomes agree to one taxonomy. Also in this case we use the exact same (agreeing) NCBI taxonomy.

[From the next step we have an inconsistent NCBI annotations]

  1. For the remainig specIs, I consider only genomes with a proper binomial annotation. Which means I need to have only two words (genus and species) and the second word cannot be "sp." (example "Escherichia coli" is a valid name, while "Escherichia sp." is not). Now there are 4 cases:
    • i. Now the remaining genomes are agreeing into only one NCBI taxonomy, which we use to annotate the specI;
    • ii. There are two valid annotation. We report both annotations. Example: we have Bacteroides dorei and Bacteroides vulgatus in the same cluster; it becomes Bacteroides dorei/vulgatus. Example for taxonomic level higher than species (there are two class annotation): Proteobacteria class [Gammaproteobacteria/Alphaproteobacteria]
    • iii. There are at least three valid annotations. Then we provide as annotation the higher taxonomic level plus "incertae sedis". Example with four family annotation: Thermoanaerobacterales fam. incertae sedis (where Thermoanaerobacterales is the order annotation that is agreeing)
    • iv. There is no annotation because there were no genomes with valid names

[From the next step we have weird annotations like "Candidatus Yanofskybacteria bacterium RIFCSPLOWO2_02_FULL_45_10" with almost no annotation at higher taxonomic level]

  1. For the remaining weird annotation we try to solve them. There are two cases:

    • i. the base of the annotation is the same. Example: two genomes _Candidatus Wildermuthbacteria bacterium RIFCSPHIGHO2_01_FULL_4922b and _Candidatus Wildermuthbacteria bacterium RIFCSPHIGHO2_02_FULL_499 becomes Candidatus Wildermuthbacteria bacterium
    • ii. the base is different, and we take the higher level annotation (which might be going till Kingdom level). Example: two genomes _Candidatus Nomurabacteria bacterium GW2011_GWB1_476 and _Candidatus Yanofskybacteria bacterium RIFCSPLOWO2_02_FULL_4510 is annotated as Bacteria sp.
Jibowe commented 2 years ago

Hi, iv. There is no annotation because there were no genomes with valid names. It means "unassigned"?

AlessioMilanese commented 2 years ago

Hi @Jibowe, The one in (iv) are the ones that go to point 4. They are still ref-mOTUs, they do not go to unassigned. The unassigned is explained here: https://github.com/motu-tool/mOTUs/wiki/Explain-the-resulting-profile#unassigned