leylabmpi / Struo2

Scalable creating/updating of metagenome profiling databases
MIT License
58 stars 8 forks source link

More detail on classification % and taxonomy issue for Kraken2 #31

Open dgolden96 opened 2 years ago

dgolden96 commented 2 years ago

Hi again! My supervisor and I have been troubleshooting our efforts to update the GTDB_release207 pre-made custom database with a selection of genomes from JGI GOLD (see previous issue #29). Based on Nick’s suggestion, I’m thinking of using the fastANI package to test the mean genomic distance between genomes in our TSV of genomes from the JGI GOLD and the genomes in GTDB_release207, but we’ve got a couple of reasons to think that we have a separate problem. We’re actually expecting relatively high ANI similarity because we’re focusing on expanding subspecies-level diversity in the database rather than adding new lineages, but out of the ~4,000 new genomes we’ve added, there were none that were identified with the read data we’re using. Basically, after adding our new genomes, we’re only getting differences in which reads are classified with which database genomes, rather than having any reads classified with our newly added genomes. We know the pipeline and our input TSV for the update should be working properly, since we tested it out by using that JGI input to update the GTDBr95_n10 toy database and saw an increase in classification percentage of read data (from small fractions of 1% to a range of values between 5% and 10% depending on the read data we were classifying). Naturally, for that testing process, we are seeing reads classified with the added JGI taxa, which is encouraging, but we’re still seeing some things we can’t really make heads or tails of. For instance, when we analyze the Kraken2 and Bracken output and report files for Kraken2 and Bracken calls on either the test (toy + JGI) or the experimental (GTDB + JGI) databases, we see that most of the rows in those files refer to tax IDs that don’t exist in the database. In some cases, I think this is due to the tax IDs being hashed as Struo2 adds the new genomes to the database, but I’m not sure why this would be the case for some but not others. The main thing we’re seeking advice about then is how we can go about linking these unidentified tax IDs with their source taxonomies, and how we can make sure that we don’t have an issue with the pre-made GTDB_release207 taxonomy not merging properly with the taxonomies in our input TSV. I can send over the input TSV we've been using (as well as any other relevant files) as needed. Thanks!

Also, as a side note, we’re wondering if you have a copy of the version of the sample TSV that was used to generate the pre-made GTDB_release207 database. It seems like there are more total genomes in the ar53_metadata_r207.tsv and bac120_metadata_r207.tsv files than in the pre-made database itself, so we’re thinking those files were combined and filtered to generate the original TSV.

nick-youngblut commented 2 years ago

Thanks for the detailed investigation and description of the problem!

For your new ~4k genomes, are you using the taxids from https://github.com/shenwei356/gtdb-taxdump?

Also, as a side note, we’re wondering if you have a copy of the version of the sample TSV that was used to generate the pre-made GTDB_release207 database.

The genomes used for the Struo2 GTDB-207 custom databases can be found at: http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release207/metadata/

It seems like there are more total genomes in the ar53_metadata_r207.tsv and bac120_metadata_r207.tsv files than in the pre-made database itself

As stated in the Struo2 paper, we filtered the GTDB genomes by quality and selected only one representative per species

dgolden96 commented 2 years ago

For your new ~4k genomes, are you using the taxids from https://github.com/shenwei356/gtdb-taxdump?

We've just been using the NCBI tax IDs provided for each species in the JGI download data. If we need to use the hashed tax IDs in our input TSV though, that would definitely explain the incompatibilities we've been hypothesizing exist between our input genomes' taxonomy data and the taxonomy data in the pre-made database! I'll take a detailed look at https://github.com/shenwei356/gtdb-taxdump this evening.

The genomes used for the Struo2 GTDB-207 custom databases can be found at: http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release207/metadata/

As stated in the Struo2 paper, we filtered the GTDB genomes by quality and selected only one representative per species

Sorry, I think my question wasn't as clear as it should have been; we know that the GTDB genomes are filtered to include one high quality representative per species in the pre-made database. My question was more like "is the filtered version of those metadata files, used as the sample TSV to run db-create to generate the pre-made database, available for download?" Having said that, now that you've pointed out the reference file at http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release207/metadata/ , I figure I may be able to use that to filter the overall GTDB metadata files with a little bit of data wrangling. Thanks! I'll let you know how things go, much appreciated.

nick-youngblut commented 2 years ago

I hope that switching to the gtdb-taxdump taxids solves your issue.

The metadata at http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release207/metadata/ is basically what was used for running the db-create portion of the Struo2 pipeline. You would need to download all of the genomes and provide a column that includes the paths to the genomes.

dgolden96 commented 2 years ago

Would you happen to know if there's any additional documentation from gtdb-taxdump regarding how the tax IDs are generated? My experience with hashing is a little bit limited, and the repo's readme file doesn't shed all that much light on the process. Any help would be appreciated!

nick-youngblut commented 2 years ago

It would be best to ask @shenwei356 directly about how the taxids are generated

dgolden96 commented 2 years ago

Will do! Thanks!

dgolden96 commented 2 years ago

Hi again! We've made some progress, but we're running into some additional issues. We have successfully generated the hashed GTDB tax IDs for our input genomes, but attempting the database update using these along with the NCBI taxonomy data for our input genomes yielded the same results that we got using the original NCBI tax IDs. We looked more closely at our NCBI taxonomy data and found that there were some differences in the nomenclature of various taxa due to a recent update in how taxa are named by researchers. We're attempting to use this script ncbi-gtdb_map.py to generate GTDB taxonomy data for our genomes, and we have managed to run the script successfully, but the script generates an output TSV file for which all values in the gtdb_taxonomy column are "unclassified." We've attempted to use both taxonomy and tax ID queries with the same result. Also, as a sanity check, we ran the script using the repo's provided test data, and we found that this also yielded exclusively "unclassified" gtdb_taxonomy data. Would you have any advice on possible ways of resolving this issue of converting NCBI tax IDs and/or NCBI taxonomy data to GTDB taxonomy data? Thanks!

nick-youngblut commented 2 years ago

We looked more closely at our NCBI taxonomy data and found that there were some differences in the nomenclature of various taxa due to a recent update in how taxa are named by researchers

Why not just use taxonomy based on GTDB-Tk classifications?

In regards to ncbi-gtdb_map.py, can you please provide more information on how you ran the code.