Preserve IDs - Githubissues

andrewjmc commented 3 years ago

Hi Nick,

Thanks for this tool. Is there any way to preserve the original GTDB taxids in the dmp files produced?

There is no kraken2 GTDB 16S database. However there is a DADA2 formatted database. If I could get nodes.dmp and names.dmp with GTDB taxids, it should be possible to mung the DADA2 fasta into a kraken2 fasta!

Alternatively, does the tool produce a lookup table of GTDB ID to new ID?

Thanks,

Andrew

nick-youngblut commented 3 years ago

What do you mean by "original GTDB taxids"? The GTDB doesn't have taxids, AFAIK. The GTDB metadata tables only contain NCBI taxids

andrewjmc commented 3 years ago

That's a great question and embarrassingly demonstrates my question was ill informed! I assumed GTDB had to have its own stable IDs since NCBI genera, species etc can be split (e.g. Streptococcus mitis is split into >30 species with A-Z, AA-AZ nomenclature). However, I can't obviously spot that in the web site: https://gtdb.ecogenomic.org/genomes?gid=GCF_900411395.1. The IDs I have in my custom kraken database are all arbitrary too, I think.

Actually, the DADA2 database (https://zenodo.org/record/4735821) only links to species representative genomes, e.g.

>Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus_mitis_AL(RS_GCF_000960005.1)

So as long as I can link from RS_GCF_000960005.1 to the generated taxid for Streptococcus_mitis_AL, I should be fine.

nick-youngblut commented 3 years ago

It would be great if the GTDB did create stable taxids, but currently, they do not. My script in this repo can create taxids, but I don't try to keep them stable across releases.

Why do you want a kraken2 16S database? I'm guessing that kraken2 wouldn't be very accurate when just using one (highly conserved) locus. Why not use a standard approach like the Qiime2 taxonomic classifier? You can always use the ncbi-gtdb_map.py script in this repo to map taxonomic classifications post-hoc.

andrewjmc commented 3 years ago

Long story... we have shotgun metagenomic data, and want to estimate carriage proportion of antibiotic resistance genes within genera. We have kraken RAs based on full GTDB sequences, but this is obviously genome size dependent so doesn't give a directly corresponding measure to the depth of coverage of a genus' antibiotic resistance gene. Kraken2 is blazingly fast, easy to run and we have already used it to taxonomically assign contigs, so it's preferential to use the same database and tool if possible. There is also supportive data for using kraken2 even on targeted 16S: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00900-2#Sec18

But if I can't make it fly, we may fall back on other approaches! Munging taxonomies is rarely easy, hence the value of tools like yours.

nick-youngblut commented 3 years ago

So you want to do this via a profiling approach instead of an assembly (genome or genes) approach? If you have the AMR genes, you can just map reads to the genes via DIAMOND and estimate abundances.

shenwei356 commented 2 years ago

Using persistent TaxIds could be a good idea to track the changes across different versions of GTDB. I think it is feasible because GTDB uses the NCBI assembly accession as the genome Id.

# assembly accessions are stable in NCBI
GCA_000016605.1 -> GCA_000016605
GCF_000011005.1 -> GCF_000011005

nick-youngblut commented 1 year ago

Closing. For anyone interested in persistent taxids, see https://github.com/shenwei356/gtdb-taxdump

nick-youngblut / gtdb_to_taxdump

Preserve IDs #8