nick-youngblut / gtdb_to_taxdump

Convert GTDB taxonomy to NCBI taxdump format
MIT License
66 stars 13 forks source link

Fix bug in `acc2gtdb_tax.py` #18

Closed Sidduppal closed 2 years ago

Sidduppal commented 2 years ago

Fixes issue#17. The genomes in the database directory in gtdbtk_<versionNumber>_data.tar.gz as well as in the representative genome directory (gtdb_genomes_reps_r207.tar.gz) are in the form /database/GCA/001/508/855/<genome>.fna.gz. The seq_acc2tax function in acc2gtdb_tax.py groups genomes into GCA or GCF by grabbing the string from the genome path. However, currently, it's splitting the string by / and then grabbing the fourth element from the end (link). This is incorrect as it grabs the number before GCA or GCF and not GCA or GCF itself. This PR fixes that by grabbing the fifth element from reverse after splitting.

nick-youngblut commented 2 years ago

It seems like the edit:

acc_prefix = acc_code[splitpath[-5]]

could possibly break the usage.

What do you think @maxibor?

maxibor commented 2 years ago

Hey @nick-youngblut @Sidduppal , Sorry, I haven't had much time to look at this, and won't have time until next week :(

nick-youngblut commented 2 years ago

@maxibor I've added a gitpod setup to this repo, if you want to use that for helping with this PR