Genbank file issues with spliced proteins

nick-youngblut commented 10 years ago

Hi Matt, A girl in my lab is trying to set up ITEP with some bacterial genomes. She ran into an issue: convertGenbank2table.py fails because some of here genbank files (downloaded from RAST) have join() in the CDS location info. Example:

CDS join(544..589,688..>1032) /product="T-cell receptor beta-chain"

She's just going to delete those CDS features, but this is definitely not optimal.

Thanks. Nick

mattb112885 commented 10 years ago

Nick:

I haven't seen this before in bacterial genomes; the flat formats used in ITEP are not really designed to handle splicing.

That being said can you tell me the error you're getting?

Matt

nick-youngblut commented 10 years ago

Hi Matt, Here's the error I'm getting:

Traceback (most recent call last): File "./convertGenbank2table.py", line 406, in raise KeyError KeyError

Using genbank files from RAST, a subset of CDS looks like this:

CDS join(5090..8716,8720..8794)

Thanks much!

Best, Mallory

Begin forwarded message:

From: Nicholas David Youngblut nyoungblut@cornell.edu Subject: FW: [clusterDbAnalysis] join() in genbank (#59) Date: February 27, 2014 2:55:43 PM EST To: "mjchoudoir@gmail.com" mjchoudoir@gmail.com

Hi Mallory, Matt, the creator of ITEP, got back to me on the ‘join()’ bug for convertGenbank2table.py script. Can you send him the error that you got?

Thanks. Nick

From: mattb112885 notifications@github.com Reply-To: mattb112885/clusterDbAnalysis reply@reply.github.com Date: Thursday, February 27, 2014 at 2:32 PM To: mattb112885/clusterDbAnalysis clusterDbAnalysis@noreply.github.com Cc: Nicholas Youngblut nyoungb2@gmail.com Subject: Re: [clusterDbAnalysis] join() in genbank (#59)

Nick:

I haven't seen this before in bacterial genomes; the flat formats used in ITEP are not really designed to handle splicing.

That being said can you tell me the error you're getting?

Matt

— Reply to this email directly or view it on GitHub.

mattb112885 commented 10 years ago

Line 406 is a blank line now (there have been some changes to that script since the initial release). Could you update your copy of ITEP with this command

$ git pull origin master

and tell me if there is still a problem / what error you get when trying to run it?

Thanks and best

Matt

mattb112885 commented 10 years ago

I looked into this a little (with an arabadopsis chromosome). It succeeded in making a table with all the genes (it treats the location as if there was no splice site) with biopython 1.61 and the latest ITEP code. However, I do need to fix a problem with multiply-spliced proteins in Genbank files (the ITEP IDs won't be added for mutliply-spliced proteins because I didn't build the lookup table correctly, assuming that there would only be one protein in the same region of DNA). I'll fix that problem, but it is unlikely to affect you with a bacterial genome.

Matt

nick-youngblut commented 10 years ago

Thanks for looking into it.

Nick

From: mattb112885 notifications@github.com Reply-To: mattb112885/clusterDbAnalysis <reply+i-28439260-596214c9a8d085e465614b963a5c5a06a869676a-2468572@reply.git hub.com> Date: Thursday, February 27, 2014 at 5:23 PM To: mattb112885/clusterDbAnalysis clusterDbAnalysis@noreply.github.com Cc: Nicholas Youngblut nyoungb2@gmail.com Subject: Re: [clusterDbAnalysis] join() in genbank (#59)

I looked into this a little (with an arabadopsis chromosome). It succeeded in making a table with all the genes (it treats the location as if there was no splice site) with biopython 1.61 and the latest ITEP code. However, I do need to fix a problem with multiply-spliced proteins in Genbank files (the ITEP IDs won't be added for mutliply-spliced proteins because I didn't build the lookup table correctly, assuming that there would only be one protein in the same region of DNA). I'll fix that problem, but it is unlikely to affect you with a bacterial genome.

Matt

‹ Reply to this email directly or view it on GitHub https://github.com/mattb112885/clusterDbAnalysis/issues/59#issuecomment-362 99317 .

mattb112885 / clusterDbAnalysis

Genbank file issues with spliced proteins #59