Closed Jigyasa3 closed 3 years ago
The issue was that the amino acid sequence files in GTDB-r95 are not gzip'ed, so gtdb_to_diamond.py wasn't finding them (as shown in your log). I've updated the script so that it should be compatible with all GTDB releases.
Hey @nick-youngblut
Thanks for updating the script! There was another error. Error- TypeError: a bytes-like object is required, not 'str'
Which was resolved by updating line 180 of gtdb_to_diamond.py
Before-
with _open(outfile, 'w') as outF:
After-
with _open(outfile, 'wb') as outF:
Thanks again for help!
It was an encoding error when using the --gzip
option. It should now be fixed. Note that gzip'ing the output will require a lot more time than writing the uncompressed sequences.
Hey @nick-youngblut and @shenwei356
I ran the scripts to generate GTDB database and accession files for DIAMOND blast search. But it's generating an empty "gtdb_all.faa.gz" file.
Code used-
to generate the nodes.dmp and names.dmp file-
$python /home/j/jigyasa-arora/local/gtdb_to_taxdump/gtdb_to_taxdump.py https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/ar122_taxonomy_r95.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/bac120_taxonomy_r95.tsv > taxID_info.tsv
to generate the database-
$python /home/j/jigyasa-arora/local/gtdb_to_taxdump/gtdb_to_diamond.py -g gtdb_proteins_aa_reps_r95.tar.gz /home/j/jigyasa-arora/local/gtdb_to_taxdump/names.dmp /home/j/jigyasa-arora/local/gtdb_to_taxdump/nodes.dmp
Output- 2020-12-31 14:01:34,087 - Read nodes.dmp file: /home/j/jigyasa-arora/local/gtdb_to_taxdump/nodes.dmp 2020-12-31 14:01:34,168 - File written: gtdb_to_diamond/nodes.dmp 2020-12-31 14:01:34,169 - Reading dumpfile: /home/j/jigyasa-arora/local/gtdb_to_taxdump/names.dmp 2020-12-31 14:01:34,807 - File written: gtdb_to_diamond/names.dmp 2020-12-31 14:01:34,807 - No. of accession<=>taxID pairs: 237629 2020-12-31 14:01:34,807 - Extracting tarball: gtdb_proteins_aa_reps_r95.tar.gz 2020-12-31 14:08:05,372 - No. of .faa.gz files: 0 2020-12-31 14:08:05,384 - Creating accession2taxid table... 2020-12-31 14:08:05,385 - File written: gtdb_to_diamond/accession2taxid.tsv 2020-12-31 14:08:05,385 - Formating & merging faa files... 2020-12-31 14:08:05,386 - File written: gtdb_to_diamond/gtdb_all.faa.gz 2020-12-31 14:08:05,386 - Temp-dir removed: gtdb_to_diamond_TMP
Finally, it creates a folder "gtdb_to_diamond", but all the files are empty. I am using python 3.7.3 and GTDB release 95.
Any suggestions? Regards Jigyasa