Open bheimbu opened 1 year ago
I've changed the keyerror to a warning, which should provide more info on whether there are many non-overlapping accessions between the tarball and dmp files, or if it is just GCA003697015.1
. Run the command again and see how many warnings that you get.
After updating gtdb_to_diamond.py
I get following error:
gtdb_to_diamond.py -o gtdb_vers207 gtdb_proteins_aa_reps_r207.tar.gz taxdump/names.dmp taxdump/nodes.dmp
2023-08-24 09:28:58,061 - Read nodes.dmp file: taxdump/nodes.dmp
2023-08-24 09:28:58,616 - File written: gtdb_vers207/nodes.dmp
2023-08-24 09:28:58,616 - Reading dumpfile: taxdump/names.dmp
2023-08-24 09:29:01,492 - File written: gtdb_vers207/names.dmp
2023-08-24 09:29:01,492 - No. of accession<=>taxID pairs: 398700
2023-08-24 09:29:01,493 - Extracting tarball: gtdb_proteins_aa_reps_r207.tar.gz
2023-08-24 09:29:01,493 - Extracting to: gtdb_to_diamond_TMP
2023-08-24 10:06:43,630 - No. of .faa(.gz) files: 65703
2023-08-24 10:06:43,675 - Creating accession2taxid table...
2023-08-24 10:06:43,676 - WARNING: Cannot find GCA003697015.1 accession in names.dmp
Traceback (most recent call last):
File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 146, in <module>
main(args)
File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 135, in main
accession2taxid(names_dmp, faa_files, args.outdir)
File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 84, in accession2taxid
line = [acc_base, accession, str(taxID), '']
UnboundLocalError: local variable 'taxID' referenced before assignment
Cheers Bastian
I've changed your code (here is the adjusted python script) and it now runs. However, all accession numbers in accession2taxid.tsv
are assigned to Not found
, that is gtdb_to_diamond.py
gives me for every accession number, e.g. Cannot find GCA001315985.1 accession in names.dmp
. So there must be wrong with the nodes.dmp
file, right?
Hi @nick-youngblut,
when I try to build the gtdb database using r207, I get:
I've downloaded
names.dmp
andnodes.dmp
from here andgtdb_proteins_aa_reps_r207.tar.gz
from here.Any help is highly appreciated,
Bastian