Using different starting NCBI taxonomy gives different results

Jigyasa3 commented 3 years ago

Hey @shenwei356 and @nick-youngblut

Thank you so much for writing the python scripts to convert taxonomy between NCBI and GTDB. It will save so much time as compared to redoing the analysis using the new database!

I do have a query. When I use the NCBI taxonomy of a protein at given species or genus level, I get results that find LCA at genus and species level using ncbi-gtdb_map.py. eg- ncbi_taxonomy gtdb_taxonomy lca_frac target_tax_level dEukaryota unclassified NA NA sGordonibacter pamelaeae sGordonibacter pamelaeae 1.0 species sCohaesibacter haloalkalitolerans sCohaesibacter haloalkalitolerans 1.0 species sTreponema primitia g__Treponema_E 1.0 genus sPrevotella bergensis sPrevotella bergensis 1.0 species sTreponema socranskii sTreponema_D socranskii 0.667 species

But when I group my NCBI taxonomy of a protein at family level, then ncbi-gtdb_map.py gives me mainly domain level or unclassified taxonomy.

eg- Same protein, but taking family level taxonomy as compared to species or genus level one. ncbi_taxonomy gtdb_taxonomy lca_frac target_tax_level fEggerthellaceae dBacteria 0.95 domain fCohaesibacteraceae dBacteria 0.94 domain fSpirochaetaceae dBacteria 0.92 domain fPrevotellaceae dBacteria 0.96 domain fPorphyromonadaceae dBacteria 0.92 domain fAnaplasmataceae dBacteria 0.93 domain

As the ncbi-gtdb_map.py finds the LCA of a given NCBI taxonomy in GTDB, does this result mean that family level annotation in GTDB is more stringent or there are less representatives giving the final LCA at domain level? Would you recommend that I use the species/genus level NCBI taxonomy to convert to GTDB?

Happy holidays! Please reply after the holidays, there is no rush!

nick-youngblut commented 3 years ago

Thanks for your interest in ncbi-gtdb_map.py! I hope that you find it to be useful.

I've found the same: most taxonomic classifications at coarser levels than genus/species will just map to the Bacteria domain, unless you reduce the LCA stringency score (--fraction). This suggest that the coarser-level taxonomies often do not map "cleanly" between NCBI and GTDB. You can also more stringently filter the genomes used for NCBI-GTDB taxonomy mapping (--completeness & --contamination) if you suspect that some poorly assembled genomes are taxonomically mis-classified.

Jigyasa3 commented 3 years ago

Thank you so much for a prompt reply @nick-youngblut!

I will play with the LCA stringency score and completeness and contamination cut-off. I wanted to ask is redoing the taxonomic annotation with GTDB database more accurate than converting the NCBI annotation to GTDB annotation?

nick-youngblut commented 3 years ago

is redoing the taxonomic annotation with GTDB database more accurate than converting the NCBI annotation to GTDB annotation?

I haven't tested that. I guess it would depend on the accuracy of you NCBI taxonomic annotations.

Jigyasa3 commented 3 years ago

Thanks! I am testing it on a gut metagenome sample, will also let you know if there are huge differences or not. Thanks again for all the help!

Jigyasa3 commented 3 years ago

Hey @nick-youngblut

I compared the ncbi-gtdb_map.py script (hereafter called as method1) to taxonomy annotated via gtdb_to_diamond.py (hereafter called as method2) generated all_gtdb.faa file as a database. The results are very different!

method1 results- gtdbdomain n 1 dArchaea 178 2 dBacteria 2550

method2 results- bastadomain n 1 Bacteria 141 2 Eukaryota 2587

For method2, I am running diamond blastx of my shotgun sequenced metagenome contigs against gtdb.dmnd database, and then using BASTA https://github.com/timkahlke/BASTA/wiki to generate LCA from accession2taxid.tsv file. Why do you think there is such a huge difference in the taxonomy of the same file?

nick-youngblut commented 3 years ago

Is you dataset likely comprised mostly of Eukaryotes? Note that GTDB is only bacteria and archaea, so all reads from eukaryote genomes will either hit nothing in the GTDB or wrongly hit a prokaryotic genome.

Jigyasa3 commented 3 years ago

I went back to check the taxon ids assigned to gtdb accession numbers in accession2taxid.tsv generated via gtdb_to_diamond.py script. I think there is something wrong in the accession2taxid.tsv file.

For example, GCA001818315.1 accession is assigned to taxon id 164650 in the accession2taxid.tsv file. When I manually check the taxon id in the NCBI database, I get a Eukaryote hit https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=164650. In GTDB, this accession is dBacteria;pPatescibacteria;cABY1;oUBA9570;fHO2-48-17;gWO2-43-9;s__WO2-43-9 sp001818315

Similarly, GCA900321315.1 accession is assigned to taxon id 219029, which also gives a hit to a Eukaryote in NCBI https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=219029. In GTDB, this accession is dBacteria;pBacteroidota;cBacteroidia;oBacteroidales;fF082;gF082;s__F082 sp900321315

I only checked a couple of accessions that gave a Eukaryote hit in my metagenome. Any suggestions? Maybe the GTDB to NCBI taxon id file is incorrect? Looking forward to your reply!

nick-youngblut commented 3 years ago

The gtdb_to_diamond.py script just associates the protein sequence with the accessions via the file paths in the tarball (the accession is in the path for each protein fasta), so I don't see how that could be wrong. You can inspect the files in the tarball to see if something looks wrong.

GTDB to NCBI taxon id file is incorrect

That GTDB-NCBI mapping is just based on the GTDB metadata, so it should be correct, but I'm looking into making some small validation datasets to check that the algorithm that I'm using is working properly.

nick-youngblut / gtdb_to_taxdump

Using different starting NCBI taxonomy gives different results #2