jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
365 stars 78 forks source link

Unclassified terminology #590

Closed analopezlopez closed 1 year ago

analopezlopez commented 1 year ago

Hi,

I just have a quick question on unclassified taxa's nomenclature. Could you please tell me what is essentially the difference between those bacteria named as "Unclassified Taxa" and those other named as "Taxa no (genus/family/order) in NCBI". For example, I have some reads assigned to Unclassified Lachnospiraceae and others to Lachnospiraceae no genus in NCBI and I'm not sure why they are sorted that way. Does it depend on the database used maybe?

Thank you for your time, A

fpusan commented 1 year ago

Hi!

1) Unclassified taxa (e.g. "Unclassified Lachnospiraceae") mean reads/orfs/contigs that could not be classified at the requested level, but could be classified at a higher level. E.g. in a genus taxonomy, "Unclassified Lachnospiraceae" means reads that could not be classified at the genus level because of a lack of consensus between the best hits in the LCA algorithm, or because the similarity to the best hits in the NCBI database was too low. However, they could be classified at the family level to the "Lachnospiraceae" family. So we report then as "Unclassified Lachnospiraceae", since this is still more informative than reporting them as "Unclassified".

2) "Lachnospiraceae (no genus in NCBI)" means that there was actually consensus between the best hits in the LCA algorithm, and that the similarity was good enough to warrant a genus level classification. However, the best hits from the NCBI database were not classified at the genus level, so we can't really derive a genus level classification from them. So we report the taxonomy as "Lachnospiraceae", since this is the best taxonomy we can derive from those references (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=1898203 would be an example of such reference)

So in the first case, you didn't get a genus level classification because your sequences are too different from anything in the reference database. In the second case, your sequences actually have a good match in the database, but the database itself lacks the necessary taxonomic annotation. This distinction is probably irrelevant for most users, and I will consider unifying the terminology in the future.

analopezlopez commented 1 year ago

Understood. Thank you very much!