hyattpd / Prodigal

Prodigal Gene Prediction Software
GNU General Public License v3.0
441 stars 85 forks source link

wrong orf of interest #76

Open claudelelemaystdenis opened 4 years ago

claudelelemaystdenis commented 4 years ago

Hi, I am interested in some particular gene in my genome sequences, and unfortunately, Prodigal doesn't predict it. It predicts a shorter version that does not code for the right protein. When I look at the potential genes Prodigal considers, my gene is present (in bold), but the program chooses another gene (in bold and italic): 10099 10188 - -36.36 0.79 -37.15 TTG None None -4.77 -0.93 -31.45 0.544 10099 10248 - -15.28 -4.36 -10.93 GTG GGAG/GAGG 5-10bp 3.13 -10.05 -3.50 0.567 10099 10260 - -9.23 -4.61 -4.62 ATG None None -2.61 -4.28 2.77 0.562 10099 10335 - -9.63 -14.81 5.18 ATG GGA/GAG/AGG 5-10bp -1.73 3.41 4.00 0.549

10185 10289 - -48.46 -17.73 -30.74 TTG None None -4.07 0.66 -26.82 0.571 10185 10301 - -45.41 -13.93 -31.48 TTG None None -3.64 -3.34 -24.00 0.573 10185 10319 - -12.03 -7.63 -4.40 GTG GGA/GAG/AGG 5-10bp -3.01 3.01 -3.90 0.548 10185 10382 - -18.21 10.49 -28.70 TTG None None -2.13 -12.54 -14.03 0.525 10185 10385 - 3.51 11.90 -8.39 GTG None None -2.10 -3.69 -2.60 0.532

How can I make sure my gene gets predicted?

Gene of interest:

ATGGACCAAGGCAGAAGTGAAGTCAGTAATCCAGTTGCTGGCCAGTTTGCGTTCCCTTCAAACGCCGCGTTCGGAATGGGAGATCGCGTGCGCAAGAAATCTGGCGCCGCTTGGCAAGGCCAGATTGTCGGGTGGTACTGCACAAAATTGACCCCTGAAGGGTACGCTGTCGAGTCTGAGGCTCACCCTGGCTCGGTACAGATTTATCCTGTTGCGGCACTGGAACGCATCAACTGA

Predicted gene:

gi|xxxxxxxxxx|ref|NZ_xxxxxxxxxxxxxxxx.x|_12 # 10185 # 10385 # -1 # ID=1_12;partial=00;start_type=GTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.532 GTGTTGTCGGGCTACGCAGCAACCCTAGAAATTCAAAAGAAGGGTCATAAATGGACCAAGGCAGAAGTGAAGTCAGTAATCCAGTTGCTGGCCAGTTTGCGTTCCCTTCAAACGCCGCGTTCGGAATGGGAGATCGCGTGCGCAAGAAATCTGGCGCCGCTTGGCAAGGCCAGATTGTCGGGTGGTACTGCACAAAATTGA

hyattpd commented 4 years ago

Unfortunately, machine learning algorithms are never going to be perfect. The only way to guarantee a known gene gets found is through a database search.

Prodigal collects a variety of signals for each gene candidate. In your case, the wrong gene has better coding but a bad start site (GTG with no RBS), while the real gene has a terrible coding score but a much better start site (ATG with a 3 base RBS). So the short answer would be that Prodigal somehow has to get better at recognizing this sequence as coding. The fact its coding score is low means it uses unusual codons relative to the rest of the organism.

One thing I've thought about is an option to search candidates against a database when there is more than one reading frame (missing the start site is less big a deal than calling a gene in the wrong frame), but only if they are the best gene in their region along at least one axis (i.e. best coding score, or best start score).

claudelelemaystdenis commented 4 years ago

Thanks for your rapid answer! Your remark on unusual codons is actually really insightful :) This database option is not part of the current Prodigal right? In short, should I forget Prodigal for a tool to predict this gene?