motu-tool / mOTUs

motus - a tool for marker gene-based OTU (mOTU) profiling
GNU General Public License v3.0
147 stars 27 forks source link

Taxonomic annotation of mOTUs in v2.5 #45

Closed theavanrossum closed 3 years ago

theavanrossum commented 4 years ago

Thanks for the great tool!

I've noticed that a mOTUs cluster that had a species name in v2.0 now has a domain level name in v2.5 : Enterococcus faecalis [ref_mOTU_v2_0116] in motus2.0 has become Bacteria sp. [ref_mOTU_v25_00318] in motus2.5.

Through discussion with Alessio, he explained that this is due to the big increase in the number of references in v2.5. Because of this increase, the cluster now contains genomes that have NCBI taxon annotations in different phyla. This makes the lowest common ancestor for the cluster = Bacteria. This has happened to a few species, including E coli.

The new taxonomic annotations are correct and make sense technically, but might be misinterpreted and means that the results are harder to put in context of previous studies (i.e. decades of studies using NCBI taxonomy).

To me there are three cases: 1) no good taxonomic annotations exist for members of the specI cluster 2) good annotations exist but NCBI taxonomy conflicts with specI clustering (e.g. half cluster is from genus 1, half is from genus 2) 3) good annotations exist and NCBI taxonomy mostly agrees with specI clustering except for 1 or 2 exceptions (cluster is ‘contaminated’ with poorly annotated genomes)

In case 1, I would expect to see "Bacteria sp.”

A complex solution would try to distinguish cases 2 & 3, but this isn’t simple so it’s understandable (though not ideal :) ) to leave it to the user. For me, the important thing is to distinguish “we have no idea what this species is” from “we have a pretty good idea, but it’s complicated” (i.e. distinguishing case 1 vs 2|3)

To me, "Bacteria sp.” immediately says “completely unknown bacteria”, so I think it might be better to avoid this. (Even though using it makes sense when you think about the method.)

Taking the another example: Bacteria sp. [ref_mOTU_v25_00077] is also composed by many Enterobacter sp., including E. coli :

NA Bacteria phylum 
Proteobacteria
Bacteroidetes

NA Bacteria class 
Sphingobacteriia
Gammaproteobacteria 

NA Bacteria order 
Sphingobacteriales
Enterobacterales

NA Bacteria fam. 
Erwiniaceae
Enterobacteriaceae
Sphingobacteriaceae

NA Bacteria gen. 
Lelliottia
Pantoea
Enterobacter
Escherichia
Klebsiella
Leclercia
Pedobacter

NA Bacteria sp. 
Enterobacter sp. SENG-6
Enterobacter sp. MGH 1
Enterobacter sp. MGH 3
Enterobacter sp. MGH 6
Enterobacter sp. MGH 7
Enterobacter sp. MGH 10
Enterobacter sp. MGH 14
Enterobacter sp. MGH 15
Enterobacter sp. MGH 22
Enterobacter sp. MGH 23
Enterobacter sp. MGH 24
Enterobacter sp. MGH 25
Enterobacter sp. MGH 33
Enterobacter sp. MGH 37
Enterobacter sp. MGH 38
Enterobacter sp. BWH 37
Enterobacter sp. BIDMC 26
Enterobacter sp. BIDMC 27
Enterobacter sp. BIDMC 28
Enterobacter sp. BIDMC 30
Enterobacter sp. EGD-HP1
Enterobacter sp. DC3
Enterobacter sp. DC4
Enterobacter sp. T1-1
Enterobacter sp. UCD-UG_FMILLET
Enterobacter sp. E20
Enterobacter sp. NFIX58
Enterobacter sp. NFIX45
Enterobacter sp. NFIX59
Enterobacter sp. 940_PEND
Enterobacter sp. HMSC16D10
Enterobacter hormaechei
Enterobacter sp. BIDMC92
Leclercia sp. LK8
Enterobacter sp. BWH52
Enterobacter sp. BWH63
Enterobacter sp. BWH64
Enterobacter sp. MGH119
Enterobacter sp. MGH120
Enterobacter sp. MGH128
Enterobacter sp. BIDMC87
Enterobacter sp. BIDMC93
Enterobacter sp. BIDMC94
Enterobacter sp. BIDMC99
Enterobacter sp. BIDMC100
Enterobacter sp. BIDMC109
Enterobacter sp. 50588862
Enterobacter sp. 50793107
Enterobacter sp. 50858885
Enterobacter sp. K66-74
Enterobacter roggenkampii
Enterobacter sp. ODB01
Enterobacter sp. IF2SW-P2
Enterobacter sp. HK169
Enterobacter sp. PDC34
Pantoea sesami
Enterobacter sp. ku-bf2
Enterobacter sp. 56-7
Enterobacter sp. ST121:950178628
Enterobacter sp. J49
Enterobacter sichuanensis
Enterobacter kobei
Enterobacter genomosp. O
Enterobacter genomosp. S
Pedobacter himalayensis
Enterobacter chengduensis
Enterobacter ludwigii
Enterobacter sp. DC1
Klebsiella aerogenes
Enterobacter cloacae
Escherichia coli
Klebsiella oxytoca
Enterobacter asburiae
Lelliottia amnigena
Enterobacter cancerogenus
Lelliottia nimipressuralis
Leclercia adecarboxylata
Enterobacter bugandensis]

Bacteria sp. is correct LCA, but might not be the most useful label for this cluster. In this case, perhaps something like: mix Proteobacteria/Bacteroidetes [ref_mOTU_v25_00077] with the “mix” prefix leading the user to look at the taxonomy table more closely. Or mixed NCBI taxa in Bacteria [ref_mOTU_v25_00077] Or mixed - mostly Enterobacter sp. [...] Or mixed - similar to E. coli […] or.... ?

The last option there would require manual curation of cluster taxonomic classifications, which comes with its own problems of course. However, as a user, if I see a lot of “Bacteria sp.” (especially for common taxa) it makes it very hard to put new results into the context of decades of work that have used the NCBI taxonomy.

The way you have it now (i.e. "LCA sp.” ) makes sense and is correct — it’s just a bit hard to use practically so thought I would mention my experience. Anyway, these are just my two cents as a user.

AlessioMilanese commented 4 years ago

Hi Thea,

Thanks for bringing this up. I think it makes sense, for example I got asked a couple of time why E.coli is not in their profiles.

As a first note, you can use -u to get the name of all the species that are in a ref-mOTU cluster. This doesn't really solve the problem, but would help to get the information that you added in your previous comment.

AlessioMilanese commented 4 years ago

To me there are three cases:

  1. no good taxonomic annotations exist for members of the specI cluster
  2. good annotations exist but NCBI taxonomy conflicts with specI clustering (e.g. half cluster is from genus 1, half is from genus 2)
  3. good annotations exist and NCBI taxonomy mostly agrees with specI clustering except for 1 or 2 exceptions (cluster is ‘contaminated’ with poorly annotated genomes)

... For me, the important thing is to distinguish “we have no idea what this species is” from “we have a pretty good idea, but it’s complicated” (i.e. distinguishing case 1 vs 2|3)

To have an idea of the number of ref-mOTUs within case 1 vs 2|3 you can check the following graph:

incongruences

Case 1 would belong to the gray "Unnamed clade", while case 2|3 belong to yellow "Merge" and red "Merge and split".

Note that since human gut is highly studied, there are more representative reference genomes, and hence higher chance that human gut ref-mOTUs fall into some inconsistent mOTUs.

cmfield commented 4 years ago

Just a comment on E. coli: when I took all the associated NCBI tax_ids for the species and cross-referenced where the genomes with those tax_ids were in Freeze 12, I found they mapped to 4 different SpecI clusters. SpecI_95 is the primary E. coli cluster but I suspect that some other genomes are incorrectly labelled in NCBI and end up in the other clusters.

AlessioMilanese commented 4 years ago

One solution can be to identify the most abundant ref-mOTUs in human gut and manually curate these.

unode commented 4 years ago

Is this something that could be improved by complementing the current annotation with an NCBI taxonomy alternative? I'm thinking specifically about https://gtdb.ecogenomic.org/ (and https://github.com/Ecogenomics/GtdbTk) that tried to consolidate taxonomy and genetic distance, but wonder if there are similar efforts out there.

As I understand, the main source of inconsistency is the fact that SpecI are defined based on genetic distance and NCBI taxonomy doesn't necessarily follow the same principle. This might also help addressing the issues mentioned by @cmfield .

AlessioMilanese commented 4 years ago

There are more inconsistencies with mOTUs 2.5, compared to mOTUs 2.0 because there are more genomes within each mOTU. Some of the new genomes have a wrong annotation, and hence two corresponding mOTUs (from 2.0 and 2.5) might have a different annotation.

Number of merge inconsistencies (like Rhodobacteraceae genus [Rhodobacter/Paenirhodobacter]): Level Number of inconsistent mOTUs Percentage
Kingdom 1 0.008
Phylum 22 0.18
Class 47 0.38
Order 65 0.53
Family 133 1.09
Genus 272 2.22
Species 702 5.74

I propose two possible solutions:

  1. Check manually the 272 ref-mOTUs that are inconsistent at genus (and higher) level. Maybe not manually, but we can check if 90% of the genomes agrees into one taxonomy, then select that one; and manually check the remaining. A problem with this is that over-studied species would take over the taxonomy of that mOTU.
  2. Find the 100 most abundant mOTUs in Human Gut and check the inconsistencies manually.
AlessioMilanese commented 4 years ago

The number in the previous comment are an overestimation. Instead of 272 genus level inconsistencies, there are 91. Here is a link to the annotation of the problematic ref-mOTUs (/specI): https://docs.google.com/spreadsheets/d/1vMBFckMLDLYipW-1Jzw2jYMMfkOl5Bkcp3G65gsfOug/edit?usp=sharing

AlessioMilanese commented 3 years ago

For mOTUs 2.6.1 we change the taxonomy of 32 ref-mOTUs:

ref_mOTU_v25_00077
ref_mOTU_v25_00084
ref_mOTU_v25_00085
ref_mOTU_v25_00087
ref_mOTU_v25_00096
ref_mOTU_v25_00103
ref_mOTU_v25_00133
ref_mOTU_v25_00188
ref_mOTU_v25_00259
ref_mOTU_v25_00261
ref_mOTU_v25_00278
ref_mOTU_v25_00281
ref_mOTU_v25_00318
ref_mOTU_v25_00321
ref_mOTU_v25_00329
ref_mOTU_v25_00344
ref_mOTU_v25_00353
ref_mOTU_v25_00547
ref_mOTU_v25_00611
ref_mOTU_v25_00618
ref_mOTU_v25_00709
ref_mOTU_v25_00830
ref_mOTU_v25_00964
ref_mOTU_v25_01135
ref_mOTU_v25_01452
ref_mOTU_v25_01541
ref_mOTU_v25_01786
ref_mOTU_v25_01897
ref_mOTU_v25_02104
ref_mOTU_v25_02590
ref_mOTU_v25_02801
ref_mOTU_v25_00095
AlessioMilanese commented 3 years ago

Diff between the taxonomies:

79c79
< Bacteria sp. [ref_mOTU_v25_00077]
---
> Enterobacter sp. [ref_mOTU_v25_00077]
86,87c86,87
< Enterobacteriaceae sp. [ref_mOTU_v25_00084]
< Enterobacterales sp. [ref_mOTU_v25_00085]
---
> Klebsiella aerogenes [ref_mOTU_v25_00084]
> Klebsiella pneumoniae [ref_mOTU_v25_00085]
89c89
< Enterobacteriaceae sp. [ref_mOTU_v25_00087]
---
> Raoultella sp. [ref_mOTU_v25_00087]
97,98c97,98
< Proteobacteria sp. [ref_mOTU_v25_00095]
< Enterobacteriaceae sp. [ref_mOTU_v25_00096]
---
> Escherichia coli [ref_mOTU_v25_00095]
> Citrobacter sp. [ref_mOTU_v25_00096]
105c105
< Gammaproteobacteria sp. [ref_mOTU_v25_00103]
---
> Citrobacter sp. [ref_mOTU_v25_00103]
135c135
< Proteobacteria sp. [ref_mOTU_v25_00133]
---
> Pseudomonas sp. [ref_mOTU_v25_00133]
189c189
< Gammaproteobacteria sp. [ref_mOTU_v25_00188]
---
> Pseudomonas sp. [ref_mOTU_v25_00188]
260c260
< Bacteria sp. [ref_mOTU_v25_00259]
---
> Acinetobacter baumannii [ref_mOTU_v25_00259]
262c262
< Gammaproteobacteria sp. [ref_mOTU_v25_00261]
---
> Acinetobacter pittii [ref_mOTU_v25_00261]
279c279
< Bacteria sp. [ref_mOTU_v25_00278]
---
> Bacillus subtilis [ref_mOTU_v25_00278]
282c282
< Bacillales sp. [ref_mOTU_v25_00281]
---
> Bacillus sp. [ref_mOTU_v25_00281]
319c319
< Bacteria sp. [ref_mOTU_v25_00318]
---
> Enterococcus faecalis [ref_mOTU_v25_00318]
322c322
< Bacteria sp. [ref_mOTU_v25_00321]
---
> Enterococcus faecium [ref_mOTU_v25_00321]
330c330
< Bacilli sp. [ref_mOTU_v25_00329]
---
> Bacillus sp. [ref_mOTU_v25_00329]
345c345
< Bacilli sp. [ref_mOTU_v25_00344]
---
> Staphylococcus sp. [ref_mOTU_v25_00344]
353c353
< Mycobacteriaceae sp. [ref_mOTU_v25_00353]
---
> Mycobacteroides abscessus [ref_mOTU_v25_00353]
545c545
< Alcaligenaceae sp. [ref_mOTU_v25_00547]
---
> Achromobacter sp. [ref_mOTU_v25_00547]
608c608
< Streptomycetaceae sp. [ref_mOTU_v25_00611]
---
> Streptomyces sp. [ref_mOTU_v25_00611]
615c615
< Actinobacteria sp. [ref_mOTU_v25_00618]
---
> Streptomyces sp. [ref_mOTU_v25_00618]
706c706
< Hafniaceae sp. [ref_mOTU_v25_00709]
---
> Hafnia alvei [ref_mOTU_v25_00709]
827c827
< Enterobacterales sp. [ref_mOTU_v25_00830]
---
> Serratia marcescens [ref_mOTU_v25_00830]
960c960
< Bacteria sp. [ref_mOTU_v25_00964]
---
> Micrococcus sp. [ref_mOTU_v25_00964]
1129c1129
< Bacteria sp. [ref_mOTU_v25_01135]
---
> Methylobacterium sp. [ref_mOTU_v25_01135]
1442c1442
< Streptomycetaceae sp. [ref_mOTU_v25_01452]
---
> Streptomyces sp. [ref_mOTU_v25_01452]
1530c1530
< Alcaligenaceae sp. [ref_mOTU_v25_01541]
---
> Alcaligenes faecalis [ref_mOTU_v25_01541]
1775c1775
< Stenotrophomonas sp. [ref_mOTU_v25_01786]
---
> Stenotrophomonas maltophilia [ref_mOTU_v25_01786]
1885c1885
< Streptomycetaceae sp. [ref_mOTU_v25_01897]
---
> Streptomyces sp. [ref_mOTU_v25_01897]
2092c2092
< Bacteria sp. [ref_mOTU_v25_02104]
---
> Aerococcus sp. [ref_mOTU_v25_02104]
2573c2573
< Bacillaceae sp. [ref_mOTU_v25_02590]
---
> Parageobacillus thermoglucosidasius [ref_mOTU_v25_02590]
2781c2781
< Bacilli sp. [ref_mOTU_v25_02801]
---
> Staphylococcus warneri [ref_mOTU_v25_02801]
AlessioMilanese commented 3 years ago

If a ref-mOTU contains at least 10 genomes and at least 80% of those agree, then we select the taxonomy annotation of the agreeing 80%.

This change allow to have E.coli in the profiles: Proteobacteria sp. [ref_mOTU_v25_00095] becomes Escherichia coli [ref_mOTU_v25_00095]