Closed Askarbek-orakov closed 3 years ago
You are formatting the queries correctly. The s__
indicates that there is no NCBI species-level taxonomy for that GTDB species (just coarser taxonomic classifications). You can query the GTDB metadata file (eg., grep s__CAG-485 sp900556595
) to confirm that.
In such case reporting of the available coarser taxonomic classification is needed. I guess currently this script cannot do that, right?
Currently no, but it would likely be just a minor change
Okay, Thanks!
Currently no, but it would likely be just a minor change
Would you have time to do it? I want to do this in scale and tried to remove trailing empty levels from NCBI taxonomy. It helps in many cases but sometimes results in such NCBI_taxonomy output s__GB_GCA_003265975.1
when there is only one hit in the GTDB metadata table. I could script it to get the lowest level available NCBI taxonomy for this assembly ID but not sure if it would preserve the intended behaviour. What is your suggestion? Is simply removing the trailing empty levels a decent solution?
OK. The script will now not allow the resulting taxonomic classification to be ^[pcofgs]__$
. Let me know if that doesn't fix your issue.
The format is right, but it seems like there is some bug now since querying "gAquabacter" results in "gBacillus", 0.857 and "genus". While relevant hits have following ncbi taxonomy:
grep "g__Aquabacter" GTDB_metadata_r95.tsv | cut -f79
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Xanthobacteraceae;g__Azorhizobium;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__Aquabacterium parvum
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__Aquabacterium olei
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Hyphomicrobiaceae;g__Aquabacter;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Hyphomicrobiaceae;g__Aquabacter;s__Aquabacter spiritensis
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__Aquabacterium commune
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__Comamonadaceae;g__;s__
Also, somehow it seems to be biased towards exactly "g__Bacillus" in general since more than half of 17K queries assign to it.
Thank you a lot again!
Thank you for pointing this out! There was actually a serious bug in how the graphs were created, which caused many taxonomic classification errors. I've (hopefully) fixed the bug and have warned users about the bug in prior versions (see the README).
OK. The script will now not allow the resulting taxonomic classification to be
^[pcofgs]__$
. Let me know if that doesn't fix your issue.
This fix doesn't seem to work anymore. Example query "fBM003" returns "sGB_GCA_002868855.1".
Thanks for fixing that bug above!
OK, that should be fixed. 2 notes:
--fraction
(now set to 0.9), given the fixed algorithm
Hi @nick-youngblut & @shenwei356,
Thanks for your script and some of my time saved! I am trying to infer closest ncbi taxids (at any level) from GTDB taxonomy.
Just to check I am doing it right. I take "classification" column from gtdbtk.*.summary.tsv output table by GTDB-Tk and remove trailing empty levels such as ";s" from ends of classifications and take the rightmost level clade assignment as the query. Example, "sCAG-485 sp900556595".
However, I then get some results with empty levels. Example:
1) Do I just need to edit the "ncbi_taxonomy" column in the GTDB metadata file to remove trailing empty levels to get proper taxon level and names? 2) If my terminal aim is to get NCBI taxids for the lowest (closer to species) level possible, is it possible to do it right away in your script?
Thank you!