nick-youngblut / gtdb_to_taxdump

Convert GTDB taxonomy to NCBI taxdump format
MIT License
65 stars 13 forks source link

GTDB to NCBI mapping results in empty taxa #4

Closed Askarbek-orakov closed 3 years ago

Askarbek-orakov commented 3 years ago

Hi @nick-youngblut & @shenwei356,

Thanks for your script and some of my time saved! I am trying to infer closest ncbi taxids (at any level) from GTDB taxonomy.

Just to check I am doing it right. I take "classification" column from gtdbtk.*.summary.tsv output table by GTDB-Tk and remove trailing empty levels such as ";s" from ends of classifications and take the rightmost level clade assignment as the query. Example, "sCAG-485 sp900556595".

However, I then get some results with empty levels. Example:

s__CAG-485 sp900556595  s__     1.0     species
s__CAG-485 sp002361155  s__     1.0     species
s__CAG-485 sp002362485  s__     1.0     species

1) Do I just need to edit the "ncbi_taxonomy" column in the GTDB metadata file to remove trailing empty levels to get proper taxon level and names? 2) If my terminal aim is to get NCBI taxids for the lowest (closer to species) level possible, is it possible to do it right away in your script?

Thank you!

nick-youngblut commented 3 years ago

You are formatting the queries correctly. The s__ indicates that there is no NCBI species-level taxonomy for that GTDB species (just coarser taxonomic classifications). You can query the GTDB metadata file (eg., grep s__CAG-485 sp900556595) to confirm that.

Askarbek-orakov commented 3 years ago

In such case reporting of the available coarser taxonomic classification is needed. I guess currently this script cannot do that, right?

nick-youngblut commented 3 years ago

Currently no, but it would likely be just a minor change

Askarbek-orakov commented 3 years ago

Okay, Thanks!

Askarbek-orakov commented 3 years ago

Currently no, but it would likely be just a minor change

Would you have time to do it? I want to do this in scale and tried to remove trailing empty levels from NCBI taxonomy. It helps in many cases but sometimes results in such NCBI_taxonomy output s__GB_GCA_003265975.1 when there is only one hit in the GTDB metadata table. I could script it to get the lowest level available NCBI taxonomy for this assembly ID but not sure if it would preserve the intended behaviour. What is your suggestion? Is simply removing the trailing empty levels a decent solution?

nick-youngblut commented 3 years ago

OK. The script will now not allow the resulting taxonomic classification to be ^[pcofgs]__$. Let me know if that doesn't fix your issue.

Askarbek-orakov commented 3 years ago

The format is right, but it seems like there is some bug now since querying "gAquabacter" results in "gBacillus", 0.857 and "genus". While relevant hits have following ncbi taxonomy:

grep "g__Aquabacter" GTDB_metadata_r95.tsv | cut -f79
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Xanthobacteraceae;g__Azorhizobium;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__Aquabacterium parvum
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__Aquabacterium olei
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Hyphomicrobiaceae;g__Aquabacter;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Hyphomicrobiaceae;g__Aquabacter;s__Aquabacter spiritensis
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__Aquabacterium commune
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__Aquabacterium;s__
d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__Comamonadaceae;g__;s__

Also, somehow it seems to be biased towards exactly "g__Bacillus" in general since more than half of 17K queries assign to it.

Thank you a lot again!

nick-youngblut commented 3 years ago

Thank you for pointing this out! There was actually a serious bug in how the graphs were created, which caused many taxonomic classification errors. I've (hopefully) fixed the bug and have warned users about the bug in prior versions (see the README).

Askarbek-orakov commented 3 years ago

OK. The script will now not allow the resulting taxonomic classification to be ^[pcofgs]__$. Let me know if that doesn't fix your issue.

This fix doesn't seem to work anymore. Example query "fBM003" returns "sGB_GCA_002868855.1".

Thanks for fixing that bug above!

nick-youngblut commented 3 years ago

OK, that should be fixed. 2 notes: