nick-youngblut / gtdb_to_taxdump

Convert GTDB taxonomy to NCBI taxdump format
MIT License
66 stars 13 forks source link

Question about ncbi-gtdb_map.py #15

Closed andressamv closed 1 year ago

andressamv commented 2 years ago

Hi! Thank you for this amazing tool! I am using ncbi-gtdb_map.py for the first time, and everything worked perfectly. However, I have a conceptual question. In some cases, the script results in a NCBI taxonomy that I didn't expect. For example:

GTDB: d_Bacteria; p_Eremiobacterota; c_Eremiobacteria; o_Baltobacterales; f_Baltobacteraceae; g_Aquilonibacter

I expected the NCBI taxonomy to be Candidatus Eremiobacteria since this class is on NCBI. Instead, the script returns Candidatus Eremiobacterota (phylum). I understand that is related to the provided GTDB metadata. But how should I proceed when submitting my genomes to NCBI? What would be the problems of using Candidatus Eremiobacteria, for example?

I am having a hard time comparing GTDB-NCBI, so I appreciate any feedback on this.

nick-youngblut commented 2 years ago

"Candidatus Eremiobacterota" seems to be a legit NCBI taxonomic classification with an NCBI taxid: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=1154676&lvl=3&lin=f&keep=1&srchmode=1&unlock

What is the problem with using "Candidatus Eremiobacterota"?

andressamv commented 2 years ago

Thank you for your response! NCBI requires the use of a taxonomic name at the lowest rank that is reliable. I don't think it is a serious problem, so I am just trying to understand what to choose here.

nick-youngblut commented 2 years ago

I see: g_Aquilonibacter mapped to Candidatus Eremiobacterota

You can alter the mapping threshold to make the assignment more permissive, and then you should get a more fine-grained NCBI taxonomic classification