nick-youngblut / gtdb_to_taxdump

Convert GTDB taxonomy to NCBI taxdump format
MIT License
65 stars 13 forks source link

KeyError: 'Cannot find GCA003697015.1 accession in names.dmp' #23

Open bheimbu opened 1 year ago

bheimbu commented 1 year ago

Hi @nick-youngblut,

when I try to build the gtdb database using r207, I get:

gtdb_to_diamond.py -o gtdb gtdb_proteins_aa_reps_r207.tar.gz taxdump/names.dmp taxdump/nodes.dmp
2023-08-23 13:53:35,547 - Read nodes.dmp file: taxdump/nodes.dmp
2023-08-23 13:53:35,813 - File written: gtdb/nodes.dmp
2023-08-23 13:53:35,813 - Reading dumpfile: taxdump/names.dmp
2023-08-23 13:53:37,103 -   File written: gtdb/names.dmp
2023-08-23 13:53:37,103 -   No. of accession<=>taxID pairs: 398700
2023-08-23 13:53:37,104 - Extracting tarball: gtdb_proteins_aa_reps_r207.tar.gz
2023-08-23 13:53:37,104 -   Extracting to: gtdb_to_diamond_TMP
2023-08-23 14:13:05,126 -   No. of .faa(.gz) files: 65703
2023-08-23 14:13:05,150 - Creating accession2taxid table...
Traceback (most recent call last):
  File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 79, in accession2taxid
    taxID = names_dmp[accession]
KeyError: 'GCA003697015.1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 146, in <module>
    main(args)
  File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 135, in main
    accession2taxid(names_dmp, faa_files, args.outdir)
  File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 82, in accession2taxid
    raise KeyError(msg.format(accession))
KeyError: 'Cannot find GCA003697015.1 accession in names.dmp'

I've downloaded names.dmp and nodes.dmp from here and gtdb_proteins_aa_reps_r207.tar.gz from here.

Any help is highly appreciated,

Bastian

nick-youngblut commented 1 year ago

I've changed the keyerror to a warning, which should provide more info on whether there are many non-overlapping accessions between the tarball and dmp files, or if it is just GCA003697015.1. Run the command again and see how many warnings that you get.

bheimbu commented 1 year ago

After updating gtdb_to_diamond.py I get following error:

gtdb_to_diamond.py -o gtdb_vers207 gtdb_proteins_aa_reps_r207.tar.gz taxdump/names.dmp taxdump/nodes.dmp
2023-08-24 09:28:58,061 - Read nodes.dmp file: taxdump/nodes.dmp
2023-08-24 09:28:58,616 - File written: gtdb_vers207/nodes.dmp
2023-08-24 09:28:58,616 - Reading dumpfile: taxdump/names.dmp
2023-08-24 09:29:01,492 -   File written: gtdb_vers207/names.dmp
2023-08-24 09:29:01,492 -   No. of accession<=>taxID pairs: 398700
2023-08-24 09:29:01,493 - Extracting tarball: gtdb_proteins_aa_reps_r207.tar.gz
2023-08-24 09:29:01,493 -   Extracting to: gtdb_to_diamond_TMP
2023-08-24 10:06:43,630 -   No. of .faa(.gz) files: 65703
2023-08-24 10:06:43,675 - Creating accession2taxid table...
2023-08-24 10:06:43,676 - WARNING: Cannot find GCA003697015.1 accession in names.dmp
Traceback (most recent call last):
  File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 146, in <module>
    main(args)
  File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 135, in main
    accession2taxid(names_dmp, faa_files, args.outdir)
  File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 84, in accession2taxid
    line = [acc_base, accession, str(taxID), '']
UnboundLocalError: local variable 'taxID' referenced before assignment

Cheers Bastian

bheimbu commented 1 year ago

I've changed your code (here is the adjusted python script) and it now runs. However, all accession numbers in accession2taxid.tsv are assigned to Not found, that is gtdb_to_diamond.py gives me for every accession number, e.g. Cannot find GCA001315985.1 accession in names.dmp. So there must be wrong with the nodes.dmp file, right?