Closed Sidduppal closed 2 years ago
It seems like the edit:
acc_prefix = acc_code[splitpath[-5]]
could possibly break the usage.
What do you think @maxibor?
Hey @nick-youngblut @Sidduppal , Sorry, I haven't had much time to look at this, and won't have time until next week :(
@maxibor I've added a gitpod setup to this repo, if you want to use that for helping with this PR
Fixes issue#17. The genomes in the database directory in
gtdbtk_<versionNumber>_data.tar.gz
as well as in the representative genome directory (gtdb_genomes_reps_r207.tar.gz) are in the form/database/GCA/001/508/855/<genome>.fna.gz
. Theseq_acc2tax
function inacc2gtdb_tax.py
groups genomes into GCA or GCF by grabbing the string from the genome path. However, currently, it's splitting the string by/
and then grabbing the fourth element from the end (link). This is incorrect as it grabs the number before GCA or GCF and not GCA or GCF itself. This PR fixes that by grabbing the fifth element from reverse after splitting.