mattb112885 / clusterDbAnalysis

ITEP - Integrated Toolkit for Exploration of microbial Pan-genomes
26 stars 15 forks source link

Error in convertGenbank2table.py if /pseudo in genbank #81

Open michoug opened 7 years ago

michoug commented 7 years ago

Dear All,

I was parsing a genbank file (more precisely a gbff file (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/365/GCF_000009365.1_ASM936v1/GCF_000009365.1_ASM936v1_genomic.gbff.gz)) and due to the fact that some genes are annotated as /pseudo there are several errors such as :

WARNING: CDS found with no translation Qualifiers: locus_tag ['ABO_RS05435'] inference ['EXISTENCE: similar to AA sequence:RefSeq:WP_011588335.1'] old_locus_tag ['ABO_1051'] codon_start ['1'] pseudo [''] exception ['unextendable partial coding region'] transl_table ['11'] note ['Derived by automated computational analysis using gene prediction method: Protein Homology.']

with coordinates : complement(<1191967..1192467)

and then

BAD - the gene in the following location did not have a match in the table file! <1191966:1192467 ATGCCAAAACGGATATTCCGCAAATACCTGCCAACACCCGAACGTATTCGACAAACCAAGTCGTTGAGTTTTCTCGGGGAAGTGCTATCGGACCCAAACCTTTGGCACATTAACCGTCGTTCACTGGCCGGCGCAGCCTTTATCGGTATCTTCTCCGGTCTGTTACCTATCCCCCTGCAAATGGGGTTGGCCGCCCTGTTGGCGGTGCGCTTCCACTGCAACCTGCCGCTTTCGATTATGCTGGTATGGATTTCCAATCCAGTGACTTATGTGCCGATTTTCTATTTCACCTACCGCATTGGCGCCTGGCTGCTGGGAATGCCGCCCCACAGCGGTGAAGGCATCACCGTCGCCTGGTTTGTGGAACAGCTCATCCCATTATGGGTAGGTTCAATGCTATGCGCATTCGGGTTTGGCGGATTAGCGTACATGGCAGTAAAAGTCAGCTGGCGGCTAGCTGTAATTCGAAGTTGGAATCTACGCGCGCACCGGCGCGCACGG

Best Greg

JosephRyanPeterson commented 7 years ago

Hi Gregg, The error arises from line 192 of convertGEnbank2table.py. Essentially, if there is no translation of the gene in the Genbank file, ITEP does not create a representation of the gene. As a consequence it is not added to the gene product list and thus it is not associated with the genome. When ITEP later attempts to extract the gene sequence and associate it with a gene product, it cannot find a translation associated with that location in the genome.

I think we could do something to handle this issue, but I don't have any ideas as to what the correct behaviour should be. Being that the broader ITEP infrastructure requires translations to be associated with genes, we would somehow have to predict a translation for that pseudogene. I'm open to ideas/suggestions.

mattb112885 commented 7 years ago

Random thoughts: We could look for empty protein sequences and call Translate() on the seq object that comes out of the Genbank parsing if it is missing. However, this won't always be a good idea, in some cases the pseudogene would be useful in analysis (e.g. in cases where you're interested in the gene's evolution. If the pseudogene was not called you would use tblastn to find it but the existing annotation removes the need to do that) but in other cases it may not be (if we're only interested in what are likely to be real genes, such as in metabolic modeling, the presence of these genes means the user has to go through and filter them out later. But if the pseudogenes are given translations the user would have to do that anyway).

My guess is we would want to add something to the existing annotation to indicate that it was marked as a pseudogene? That way at least if they show up in clusters you could see that they are there and exclude them in post-processing. Thoughts?

mattb112885 commented 7 years ago

One other comment: I do not know how well ITEP and the tools it depends on (raxml, blast, etc) will behave if you have a stop codon (*) character in the middle of some of your sequences ...

JosephRyanPeterson commented 7 years ago

@mattb112885 Another minor issue with translate is that it may not be appropriate for sequences such as tRNAs or rRNAs. While most Genbank files are pretty good about marking tRNAs or rRNAs, I have run into poorly formatted files where this isn't the case. I would worry that we might end up with proteins related to non-coding RNAs.

It should be pretty straightforward to add an additional column to the database indicating that the translation was predicted rather than read. Should this be an additional column that is displayed when using "db_getGeneinformation.py"? Also, what are your thoughts for the default behaviour when creating metabolic models? Clearly, if the gene was missed by an annotation pipeline, but is a homolog of a metabolic protein (e.g. pyk) wouldn't you want it to be added to the model?

Finally, if I were to add this functionality, do you have any idea what other scripts might break in the meantime?

mattb112885 commented 7 years ago

I believe as long as you add the extra column to the end of the table, the database change itself wouldn't break any existing scripts... however, if you do this, any scripts you modify to require the extra column would no longer work with old versions of the database, and they would require a rebuild (as you know, this takes a lot of time), unless you were careful to check for this case in the scripts. An alternative would be to add something to the existing annotation field (like [MACHINE-TRANSLATED]) to indicate it and just look for that in your scripts.

I would exclude these from metabolic models - usually when genes are marked as pseudogenes explicitly, the annotation software thinks there is a change like a frameshift or nonsense mutation that made it no longer functional.