mattb112885 / clusterDbAnalysis

ITEP - Integrated Toolkit for Exploration of microbial Pan-genomes
26 stars 15 forks source link

convertGenbank2table.py not running to completion with IMG-generated genbank files #60

Closed jcthrash closed 10 years ago

jcthrash commented 10 years ago

Hello, I'm trying to convert my genbank files using convertGenbank2table.py. The genbank files were downloaded from IMG. The script does not finish, and never creates a genbank/ directory.

Command: $ ./convertGenbank2table.py -g SAR11_genbank/HTCC1062.gbk -v 2

Output/Error: /usr/local/packages/Python/2.7.3/gcc-4.4.6/lib/python2.7/site-packages/Bio/GenBank/Scanner.py:951: BiopythonParserWarning: Invalid indentation for sequence line warnings.warn("Invalid indentation for sequence line", BiopythonParserWarning) Text file saved as /project/jcthrash/tools/ITEP/raw/335992.2.txt Traceback (most recent call last): File "./convertGenbank2table.py", line 391, in fid = open(genbank_filename, "w") IOError: [Errno 2] No such file or directory: '/project/jcthrash/tools/ITEP/genbank/335992.2.gbk'

To try and comply with the input requirements, I've changed the titles to have 6 digit ids. However, this does not help:

$ ./convertGenbank2table.py -g SAR11_genbank/106234.gbk -v 2 -r

/usr/local/packages/Python/2.7.3/gcc-4.4.6/lib/python2.7/site-packages/Bio/GenBank/Scanner.py:951: BiopythonParserWarning: Invalid indentation for sequence line warnings.warn("Invalid indentation for sequence line", BiopythonParserWarning) WARNING: Backing up original gene output file /project/jcthrash/tools/ITEP/raw/335992.2.txt to location /project/jcthrash/tools/ITEP/335992.2.txt.bk in case something went wrong Text file saved as /project/jcthrash/tools/ITEP/raw/335992.2.txt Traceback (most recent call last): File "./convertGenbank2table.py", line 391, in fid = open(genbank_filename, "w") IOError: [Errno 2] No such file or directory: '/project/jcthrash/tools/ITEP/genbank/335992.2.gbk'

I will be happy to email the genbank file I'm using if you would like an example to work with.

Thanks in advance!

mattb112885 commented 10 years ago

Greetings!

Sorry I think my previous response is irrelevant. This is happening because for some reason your genbank/ directory is missing (it should be part of the repository). If you create a genbank/ directory it should work. The idea is that you put your input in some other directory (aside from genbank/ ) and the script creates a copy of it in genbank/ with ITEP IDs added to it and with the expected nomenclature.

I added a check for the directory's existence to the script and create it if it is missing.

Matt

jcthrash commented 10 years ago

Hi Matt, Thanks for your prompt response. We downloaded the software on 3/20/14. Has it been updated since? I’ll make sure to try that again if so.

FYI, attached is one of the genbank files from IMG.

-jct

J. Cameron Thrash Assistant Professor Department of Biological Sciences Louisiana State University 225-578-8210 (office) http://thethrashlab.com Twitter: @DrJCThrash

On Apr 2, 2014, at 15:23, mattb112885 notifications@github.com wrote:

Actually also can you make sure you have the latest ITEP code? (CD do a 'git pull origin master'). I have fixed a couple of bugs in the convertGenbank2Table.py script since the initial release and this line

File "./convertGenbank2table.py", line 391, in fid = open(genbank_filename, "w")

is no longer line 391.

Best

Matt

— Reply to this email directly or view it on GitHub.

mattb112885 commented 10 years ago

I edited my comment above because most likely old versions isn't the problem. Most likely the issue is that you need to have a genbank/ directory for one of the outputs to convertGenbankToTable.py (a modified Genbank file with ITEP IDs attached to each CDS), which for some reason was missing from your installation. I have edited the script to create it if it doesn't exist so if you update your ITEP it should hopefully work.

Best

Matt

jcthrash commented 10 years ago

Matt, That did solve almost everything, thank you!

Do I have any reason to worry about the following warning?

/usr/local/packages/Python/2.7.3/gcc-4.4.6/lib/python2.7/site-packages/Bio/GenBank/Scanner.py:951: BiopythonParserWarning: Invalid indentation for sequence line warnings.warn("Invalid indentation for sequence line", BiopythonParserWarning)

-jct

mattb112885 commented 10 years ago

That just means the Genbank files from IMG don't conform to the genbank spec (which is a very common problem). You might want to make sure the sequences that get outputted to the file in raw/ are the same as the ones in the Genbank file, but if they are don't worry about it.

Glad it helped.

Matt

mattb112885 commented 10 years ago

jct:

Did you get a lot of BAD messages when loading the genbank files? If you are using the contigs associated with genes you might want to make sure they're correct, I just fixed a bug on this. See issue #66 . Let me know if it runs into problems.

Matt