nick-youngblut / gtdb_to_taxdump

Convert GTDB taxonomy to NCBI taxdump format
MIT License
66 stars 13 forks source link

"gtdb_to_diamond.py" generates an empty "gtdb_all.faa.gz" file #3

Closed Jigyasa3 closed 3 years ago

Jigyasa3 commented 3 years ago

Hey @nick-youngblut and @shenwei356

I ran the scripts to generate GTDB database and accession files for DIAMOND blast search. But it's generating an empty "gtdb_all.faa.gz" file.

Code used-

to generate the nodes.dmp and names.dmp file-

$python /home/j/jigyasa-arora/local/gtdb_to_taxdump/gtdb_to_taxdump.py https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/ar122_taxonomy_r95.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/bac120_taxonomy_r95.tsv > taxID_info.tsv

to generate the database-

$python /home/j/jigyasa-arora/local/gtdb_to_taxdump/gtdb_to_diamond.py -g gtdb_proteins_aa_reps_r95.tar.gz /home/j/jigyasa-arora/local/gtdb_to_taxdump/names.dmp /home/j/jigyasa-arora/local/gtdb_to_taxdump/nodes.dmp

Output- 2020-12-31 14:01:34,087 - Read nodes.dmp file: /home/j/jigyasa-arora/local/gtdb_to_taxdump/nodes.dmp 2020-12-31 14:01:34,168 - File written: gtdb_to_diamond/nodes.dmp 2020-12-31 14:01:34,169 - Reading dumpfile: /home/j/jigyasa-arora/local/gtdb_to_taxdump/names.dmp 2020-12-31 14:01:34,807 - File written: gtdb_to_diamond/names.dmp 2020-12-31 14:01:34,807 - No. of accession<=>taxID pairs: 237629 2020-12-31 14:01:34,807 - Extracting tarball: gtdb_proteins_aa_reps_r95.tar.gz 2020-12-31 14:08:05,372 - No. of .faa.gz files: 0 2020-12-31 14:08:05,384 - Creating accession2taxid table... 2020-12-31 14:08:05,385 - File written: gtdb_to_diamond/accession2taxid.tsv 2020-12-31 14:08:05,385 - Formating & merging faa files... 2020-12-31 14:08:05,386 - File written: gtdb_to_diamond/gtdb_all.faa.gz 2020-12-31 14:08:05,386 - Temp-dir removed: gtdb_to_diamond_TMP

Finally, it creates a folder "gtdb_to_diamond", but all the files are empty. I am using python 3.7.3 and GTDB release 95.

Any suggestions? Regards Jigyasa

nick-youngblut commented 3 years ago

The issue was that the amino acid sequence files in GTDB-r95 are not gzip'ed, so gtdb_to_diamond.py wasn't finding them (as shown in your log). I've updated the script so that it should be compatible with all GTDB releases.

Jigyasa3 commented 3 years ago

Hey @nick-youngblut

Thanks for updating the script! There was another error. Error- TypeError: a bytes-like object is required, not 'str'

Which was resolved by updating line 180 of gtdb_to_diamond.py Before- with _open(outfile, 'w') as outF: After- with _open(outfile, 'wb') as outF:

Thanks again for help!

nick-youngblut commented 3 years ago

It was an encoding error when using the --gzip option. It should now be fixed. Note that gzip'ing the output will require a lot more time than writing the uncompressed sequences.