Closed andrewjmc closed 1 year ago
What do you mean by "original GTDB taxids"? The GTDB doesn't have taxids, AFAIK. The GTDB metadata tables only contain NCBI taxids
That's a great question and embarrassingly demonstrates my question was ill informed! I assumed GTDB had to have its own stable IDs since NCBI genera, species etc can be split (e.g. Streptococcus mitis is split into >30 species with A-Z, AA-AZ nomenclature). However, I can't obviously spot that in the web site: https://gtdb.ecogenomic.org/genomes?gid=GCF_900411395.1. The IDs I have in my custom kraken database are all arbitrary too, I think.
Actually, the DADA2 database (https://zenodo.org/record/4735821) only links to species representative genomes, e.g.
>Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus_mitis_AL(RS_GCF_000960005.1)
So as long as I can link from RS_GCF_000960005.1 to the generated taxid for Streptococcus_mitis_AL, I should be fine.
It would be great if the GTDB did create stable taxids, but currently, they do not. My script in this repo can create taxids, but I don't try to keep them stable across releases.
Why do you want a kraken2 16S database? I'm guessing that kraken2 wouldn't be very accurate when just using one (highly conserved) locus. Why not use a standard approach like the Qiime2 taxonomic classifier? You can always use the ncbi-gtdb_map.py
script in this repo to map taxonomic classifications post-hoc.
Long story... we have shotgun metagenomic data, and want to estimate carriage proportion of antibiotic resistance genes within genera. We have kraken RAs based on full GTDB sequences, but this is obviously genome size dependent so doesn't give a directly corresponding measure to the depth of coverage of a genus' antibiotic resistance gene. Kraken2 is blazingly fast, easy to run and we have already used it to taxonomically assign contigs, so it's preferential to use the same database and tool if possible. There is also supportive data for using kraken2 even on targeted 16S: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00900-2#Sec18
But if I can't make it fly, we may fall back on other approaches! Munging taxonomies is rarely easy, hence the value of tools like yours.
So you want to do this via a profiling approach instead of an assembly (genome or genes) approach? If you have the AMR genes, you can just map reads to the genes via DIAMOND and estimate abundances.
Using persistent TaxIds could be a good idea to track the changes across different versions of GTDB. I think it is feasible because GTDB uses the NCBI assembly accession as the genome Id.
# assembly accessions are stable in NCBI
GCA_000016605.1 -> GCA_000016605
GCF_000011005.1 -> GCF_000011005
Closing. For anyone interested in persistent taxids, see https://github.com/shenwei356/gtdb-taxdump
Hi Nick,
Thanks for this tool. Is there any way to preserve the original GTDB taxids in the dmp files produced?
There is no kraken2 GTDB 16S database. However there is a DADA2 formatted database. If I could get nodes.dmp and names.dmp with GTDB taxids, it should be possible to mung the DADA2 fasta into a kraken2 fasta!
Alternatively, does the tool produce a lookup table of GTDB ID to new ID?
Thanks,
Andrew