nick-youngblut / gtdb_to_taxdump

Convert GTDB taxonomy to NCBI taxdump format
MIT License
66 stars 13 forks source link

Error in ncbi-gtdb_map.py using NCBI taxids #9

Closed dportik closed 2 years ago

dportik commented 2 years ago

Thanks for making these useful tools! I have been looking for a quick way to compare NCBI names to GTDB names and ncbi-gtdb_map.py is great for this use-case.

I first tried converting species names from NCBI to GTDB, and it ran successfully. I noticed quite a few NCBI species were not assigned a GTDB name.

I am now trying to use the NCBI taxids to see if there is any difference. However, it looks like I've hit a bug when invoking the --names-dmp and --nodes-dmp flags. I've run:

ncbi-gtdb_map.py --names-dmp /taxdump/names.dmp --nodes-dmp /taxdump/nodes.dmp -o /Full-output NCBI-codes.txt https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/ar122_metadata_r95.tar.gz https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/bac120_metadata_r95.tar.gz 

The error is pasted below:

2022-01-11 15:01:20,595 - Loading file: /taxdump/names.dmp
Traceback (most recent call last):
  File "/usr/local/bin/ncbi-gtdb_map.py", line 628, in <module>
    main(args)
  File "/usr/local/bin/ncbi-gtdb_map.py", line 599, in main
    ncbi_tax = gtdb2td.Dmp.load_dmp(args.names_dmp, args.nodes_dmp)
  File "/usr/local/lib/python3.7/site-packages/gtdb2td/Dmp.py", line 70, in load_dmp
    with gtdb2td.Utils.Open(names_dmp_file) as inF:
NameError: name 'gtdb2td' is not defined

Any idea what might be happening here?

Thanks!

nick-youngblut commented 2 years ago

Did you actually install the package via pip, or are you just running the script from the ./bin/ directory? The error suggests that you didn't install the package ("gtdb2td"), and so gtdb2td cannot be found.

dportik commented 2 years ago

Yes it was installed with pip.

$ pip3 show gtdb_to_taxdump
Name: gtdb-to-taxdump
Version: 0.1.7
Summary: GTDB database utility scripts
Home-page: https://github.com/nick-youngblut/gtdb_to_taxdump
Author: Nick Youngblut
Author-email: nyoungb2@gmail.com
License: MIT license
Location: /usr/local/lib/python3.7/site-packages
Requires: networkx
Required-by: 

I also just tried removing the pip install and replacing with a local install with setup.py, same error:

2022-01-12 10:42:02,052 - Loading file: /Users/dportik/Documents/Projects/Proj-Zymo-TruMatrix/3-MAGs/GTDB-to-NCBI/taxdump/names.dmp
Traceback (most recent call last):
  File "/usr/local/bin/ncbi-gtdb_map.py", line 4, in <module>
    __import__('pkg_resources').run_script('gtdb-to-taxdump==0.1.7', 'ncbi-gtdb_map.py')
  File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1471, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python3.7/site-packages/gtdb_to_taxdump-0.1.7-py3.7.egg/EGG-INFO/scripts/ncbi-gtdb_map.py", line 628, in <module>
  File "/usr/local/lib/python3.7/site-packages/gtdb_to_taxdump-0.1.7-py3.7.egg/EGG-INFO/scripts/ncbi-gtdb_map.py", line 599, in main
  File "/usr/local/lib/python3.7/site-packages/gtdb_to_taxdump-0.1.7-py3.7.egg/gtdb2td/Dmp.py", line 70, in load_dmp
NameError: name 'gtdb2td' is not defined

Looks like the issue is in Dmp.py.

nick-youngblut commented 2 years ago

Do any of the other scripts work?

dportik commented 2 years ago

This script works as long as I do not use the --names-dmp and --nodes-dmp flags.

Adding a simple import statement in Dmp.py and re-installing locally fixed that issue, but I hit another error soon after:

2022-01-12 10:50:20,479 - Loading file: /Users/dportik/Documents/Projects/Proj-Zymo-TruMatrix/3-MAGs/GTDB-to-NCBI/taxdump/names.dmp
Traceback (most recent call last):
  File "/usr/local/bin/ncbi-gtdb_map.py", line 4, in <module>
    __import__('pkg_resources').run_script('gtdb-to-taxdump==0.1.7', 'ncbi-gtdb_map.py')
  File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1471, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python3.7/site-packages/gtdb_to_taxdump-0.1.7-py3.7.egg/EGG-INFO/scripts/ncbi-gtdb_map.py", line 628, in <module>
  File "/usr/local/lib/python3.7/site-packages/gtdb_to_taxdump-0.1.7-py3.7.egg/EGG-INFO/scripts/ncbi-gtdb_map.py", line 599, in main
  File "/usr/local/lib/python3.7/site-packages/gtdb_to_taxdump-0.1.7-py3.7.egg/gtdb2td/Dmp.py", line 76, in load_dmp
TypeError: cannot use a string pattern on a bytes-like object

After a bit of searching, I found I had to add .decode('utf-8') to all lines splitting the line with regex in Dmp.py. I was able to get it to finish successfully after. I'll open a pull request so you can see the relevant changes.