nick-youngblut / gtdb_to_taxdump

Convert GTDB taxonomy to NCBI taxdump format
MIT License
66 stars 13 forks source link

Add sequence accession to TAXID conversion tool #12

Closed maxibor closed 2 years ago

maxibor commented 2 years ago

One of the files often used by different taxonomic classifier is the accession2taxid file, to match sequence accession number to TAXID.

I'm not yet super familiar with GTDB, so I might have missed it, but as far as I could see, GTDB only keep tracks of accessions at the genome level. Having accessions at the sequence level is often needed for building taxonomic classifier databases, as well as the sequence accession to TAXID file.

This adds a script to do so. Using the names.dmp file created with gtdb_to_taxdump.py, it goes through all GTDB genomes, retrieves each sequence accession number, and associates it with the corresponding TAXID through the genome accession.

maxibor commented 2 years ago

And one git rebase after, pytest is now happy. This is ready for you to review @nick-youngblut

nick-youngblut commented 2 years ago

Thanks @maxibor for the additional contribution! The script looks quite useful