Closed theavanrossum closed 3 years ago
Hi Thea,
Thanks for bringing this up. I think it makes sense, for example I got asked a couple of time why E.coli is not in their profiles.
As a first note, you can use -u
to get the name of all the species that are in a ref-mOTU cluster. This doesn't really solve the problem, but would help to get the information that you added in your previous comment.
To me there are three cases:
- no good taxonomic annotations exist for members of the specI cluster
- good annotations exist but NCBI taxonomy conflicts with specI clustering (e.g. half cluster is from genus 1, half is from genus 2)
- good annotations exist and NCBI taxonomy mostly agrees with specI clustering except for 1 or 2 exceptions (cluster is ‘contaminated’ with poorly annotated genomes)
... For me, the important thing is to distinguish “we have no idea what this species is” from “we have a pretty good idea, but it’s complicated” (i.e. distinguishing case 1 vs 2|3)
To have an idea of the number of ref-mOTUs within case 1 vs 2|3 you can check the following graph:
Case 1 would belong to the gray "Unnamed clade", while case 2|3 belong to yellow "Merge" and red "Merge and split".
Note that since human gut is highly studied, there are more representative reference genomes, and hence higher chance that human gut ref-mOTUs fall into some inconsistent mOTUs.
Just a comment on E. coli: when I took all the associated NCBI tax_ids for the species and cross-referenced where the genomes with those tax_ids were in Freeze 12, I found they mapped to 4 different SpecI clusters. SpecI_95 is the primary E. coli cluster but I suspect that some other genomes are incorrectly labelled in NCBI and end up in the other clusters.
One solution can be to identify the most abundant ref-mOTUs in human gut and manually curate these.
Is this something that could be improved by complementing the current annotation with an NCBI taxonomy alternative? I'm thinking specifically about https://gtdb.ecogenomic.org/ (and https://github.com/Ecogenomics/GtdbTk) that tried to consolidate taxonomy and genetic distance, but wonder if there are similar efforts out there.
As I understand, the main source of inconsistency is the fact that SpecI
are defined based on genetic distance and NCBI taxonomy doesn't necessarily follow the same principle. This might also help addressing the issues mentioned by @cmfield .
There are more inconsistencies with mOTUs 2.5, compared to mOTUs 2.0 because there are more genomes within each mOTU. Some of the new genomes have a wrong annotation, and hence two corresponding mOTUs (from 2.0 and 2.5) might have a different annotation.
Number of merge inconsistencies (like Rhodobacteraceae genus [Rhodobacter/Paenirhodobacter] ): |
Level | Number of inconsistent mOTUs | Percentage |
---|---|---|---|
Kingdom | 1 | 0.008 | |
Phylum | 22 | 0.18 | |
Class | 47 | 0.38 | |
Order | 65 | 0.53 | |
Family | 133 | 1.09 | |
Genus | 272 | 2.22 | |
Species | 702 | 5.74 |
I propose two possible solutions:
The number in the previous comment are an overestimation. Instead of 272 genus level inconsistencies, there are 91. Here is a link to the annotation of the problematic ref-mOTUs (/specI): https://docs.google.com/spreadsheets/d/1vMBFckMLDLYipW-1Jzw2jYMMfkOl5Bkcp3G65gsfOug/edit?usp=sharing
For mOTUs 2.6.1 we change the taxonomy of 32 ref-mOTUs:
ref_mOTU_v25_00077
ref_mOTU_v25_00084
ref_mOTU_v25_00085
ref_mOTU_v25_00087
ref_mOTU_v25_00096
ref_mOTU_v25_00103
ref_mOTU_v25_00133
ref_mOTU_v25_00188
ref_mOTU_v25_00259
ref_mOTU_v25_00261
ref_mOTU_v25_00278
ref_mOTU_v25_00281
ref_mOTU_v25_00318
ref_mOTU_v25_00321
ref_mOTU_v25_00329
ref_mOTU_v25_00344
ref_mOTU_v25_00353
ref_mOTU_v25_00547
ref_mOTU_v25_00611
ref_mOTU_v25_00618
ref_mOTU_v25_00709
ref_mOTU_v25_00830
ref_mOTU_v25_00964
ref_mOTU_v25_01135
ref_mOTU_v25_01452
ref_mOTU_v25_01541
ref_mOTU_v25_01786
ref_mOTU_v25_01897
ref_mOTU_v25_02104
ref_mOTU_v25_02590
ref_mOTU_v25_02801
ref_mOTU_v25_00095
Diff between the taxonomies:
79c79
< Bacteria sp. [ref_mOTU_v25_00077]
---
> Enterobacter sp. [ref_mOTU_v25_00077]
86,87c86,87
< Enterobacteriaceae sp. [ref_mOTU_v25_00084]
< Enterobacterales sp. [ref_mOTU_v25_00085]
---
> Klebsiella aerogenes [ref_mOTU_v25_00084]
> Klebsiella pneumoniae [ref_mOTU_v25_00085]
89c89
< Enterobacteriaceae sp. [ref_mOTU_v25_00087]
---
> Raoultella sp. [ref_mOTU_v25_00087]
97,98c97,98
< Proteobacteria sp. [ref_mOTU_v25_00095]
< Enterobacteriaceae sp. [ref_mOTU_v25_00096]
---
> Escherichia coli [ref_mOTU_v25_00095]
> Citrobacter sp. [ref_mOTU_v25_00096]
105c105
< Gammaproteobacteria sp. [ref_mOTU_v25_00103]
---
> Citrobacter sp. [ref_mOTU_v25_00103]
135c135
< Proteobacteria sp. [ref_mOTU_v25_00133]
---
> Pseudomonas sp. [ref_mOTU_v25_00133]
189c189
< Gammaproteobacteria sp. [ref_mOTU_v25_00188]
---
> Pseudomonas sp. [ref_mOTU_v25_00188]
260c260
< Bacteria sp. [ref_mOTU_v25_00259]
---
> Acinetobacter baumannii [ref_mOTU_v25_00259]
262c262
< Gammaproteobacteria sp. [ref_mOTU_v25_00261]
---
> Acinetobacter pittii [ref_mOTU_v25_00261]
279c279
< Bacteria sp. [ref_mOTU_v25_00278]
---
> Bacillus subtilis [ref_mOTU_v25_00278]
282c282
< Bacillales sp. [ref_mOTU_v25_00281]
---
> Bacillus sp. [ref_mOTU_v25_00281]
319c319
< Bacteria sp. [ref_mOTU_v25_00318]
---
> Enterococcus faecalis [ref_mOTU_v25_00318]
322c322
< Bacteria sp. [ref_mOTU_v25_00321]
---
> Enterococcus faecium [ref_mOTU_v25_00321]
330c330
< Bacilli sp. [ref_mOTU_v25_00329]
---
> Bacillus sp. [ref_mOTU_v25_00329]
345c345
< Bacilli sp. [ref_mOTU_v25_00344]
---
> Staphylococcus sp. [ref_mOTU_v25_00344]
353c353
< Mycobacteriaceae sp. [ref_mOTU_v25_00353]
---
> Mycobacteroides abscessus [ref_mOTU_v25_00353]
545c545
< Alcaligenaceae sp. [ref_mOTU_v25_00547]
---
> Achromobacter sp. [ref_mOTU_v25_00547]
608c608
< Streptomycetaceae sp. [ref_mOTU_v25_00611]
---
> Streptomyces sp. [ref_mOTU_v25_00611]
615c615
< Actinobacteria sp. [ref_mOTU_v25_00618]
---
> Streptomyces sp. [ref_mOTU_v25_00618]
706c706
< Hafniaceae sp. [ref_mOTU_v25_00709]
---
> Hafnia alvei [ref_mOTU_v25_00709]
827c827
< Enterobacterales sp. [ref_mOTU_v25_00830]
---
> Serratia marcescens [ref_mOTU_v25_00830]
960c960
< Bacteria sp. [ref_mOTU_v25_00964]
---
> Micrococcus sp. [ref_mOTU_v25_00964]
1129c1129
< Bacteria sp. [ref_mOTU_v25_01135]
---
> Methylobacterium sp. [ref_mOTU_v25_01135]
1442c1442
< Streptomycetaceae sp. [ref_mOTU_v25_01452]
---
> Streptomyces sp. [ref_mOTU_v25_01452]
1530c1530
< Alcaligenaceae sp. [ref_mOTU_v25_01541]
---
> Alcaligenes faecalis [ref_mOTU_v25_01541]
1775c1775
< Stenotrophomonas sp. [ref_mOTU_v25_01786]
---
> Stenotrophomonas maltophilia [ref_mOTU_v25_01786]
1885c1885
< Streptomycetaceae sp. [ref_mOTU_v25_01897]
---
> Streptomyces sp. [ref_mOTU_v25_01897]
2092c2092
< Bacteria sp. [ref_mOTU_v25_02104]
---
> Aerococcus sp. [ref_mOTU_v25_02104]
2573c2573
< Bacillaceae sp. [ref_mOTU_v25_02590]
---
> Parageobacillus thermoglucosidasius [ref_mOTU_v25_02590]
2781c2781
< Bacilli sp. [ref_mOTU_v25_02801]
---
> Staphylococcus warneri [ref_mOTU_v25_02801]
If a ref-mOTU contains at least 10 genomes and at least 80% of those agree, then we select the taxonomy annotation of the agreeing 80%.
This change allow to have E.coli in the profiles:
Proteobacteria sp. [ref_mOTU_v25_00095]
becomes Escherichia coli [ref_mOTU_v25_00095]
Thanks for the great tool!
I've noticed that a mOTUs cluster that had a species name in v2.0 now has a domain level name in v2.5 :
Enterococcus faecalis [ref_mOTU_v2_0116]
in motus2.0 has becomeBacteria sp. [ref_mOTU_v25_00318]
in motus2.5.Through discussion with Alessio, he explained that this is due to the big increase in the number of references in v2.5. Because of this increase, the cluster now contains genomes that have NCBI taxon annotations in different phyla. This makes the lowest common ancestor for the cluster = Bacteria. This has happened to a few species, including E coli.
The new taxonomic annotations are correct and make sense technically, but might be misinterpreted and means that the results are harder to put in context of previous studies (i.e. decades of studies using NCBI taxonomy).
To me there are three cases: 1) no good taxonomic annotations exist for members of the specI cluster 2) good annotations exist but NCBI taxonomy conflicts with specI clustering (e.g. half cluster is from genus 1, half is from genus 2) 3) good annotations exist and NCBI taxonomy mostly agrees with specI clustering except for 1 or 2 exceptions (cluster is ‘contaminated’ with poorly annotated genomes)
In case 1, I would expect to see "Bacteria sp.”
A complex solution would try to distinguish cases 2 & 3, but this isn’t simple so it’s understandable (though not ideal :) ) to leave it to the user. For me, the important thing is to distinguish “we have no idea what this species is” from “we have a pretty good idea, but it’s complicated” (i.e. distinguishing case 1 vs 2|3)
To me, "Bacteria sp.” immediately says “completely unknown bacteria”, so I think it might be better to avoid this. (Even though using it makes sense when you think about the method.)
Taking the another example:
Bacteria sp. [ref_mOTU_v25_00077]
is also composed by many Enterobacter sp., including E. coli :Bacteria sp.
is correct LCA, but might not be the most useful label for this cluster. In this case, perhaps something like:mix Proteobacteria/Bacteroidetes [ref_mOTU_v25_00077]
with the “mix” prefix leading the user to look at the taxonomy table more closely. Ormixed NCBI taxa in Bacteria [ref_mOTU_v25_00077]
Ormixed - mostly Enterobacter sp. [...]
Ormixed - similar to E. coli […]
or.... ?The last option there would require manual curation of cluster taxonomic classifications, which comes with its own problems of course. However, as a user, if I see a lot of “Bacteria sp.” (especially for common taxa) it makes it very hard to put new results into the context of decades of work that have used the NCBI taxonomy.
The way you have it now (i.e. "LCA sp.” ) makes sense and is correct — it’s just a bit hard to use practically so thought I would mention my experience. Anyway, these are just my two cents as a user.