hyattpd / Prodigal

Prodigal Gene Prediction Software
GNU General Public License v3.0
432 stars 85 forks source link

Avoid skipping candidate RBS positions in `rbs_score` #102

Open althonos opened 1 year ago

althonos commented 1 year ago

Hi, one final PR :smiley:

In the test sequence I used for #100 I noticed the following bug: after reverse-complementing a sequence, the RBS spacer for one of the predicted gene was changing when the contig was reverse-complemented:

Forward:
CAKWEX010000332.1   Prodigal_v2.6.3 CDS 520 2178    298.9   -   0   ID=1_1;partial=00;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.636;conf=99.99;score=298.87;cscore=285.76;sscore=13.11;rscore=7.85;uscore=1.61;tscore=3.65;
CAKWEX010000332.1   Prodigal_v2.6.3 CDS 2250    2852    105.5   -   0   ID=1_2;partial=00;start_type=GTG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.580;conf=100.00;score=105.47;cscore=95.43;sscore=10.04;rscore=13.75;uscore=1.16;tscore=-4.87;
CAKWEX010000332.1   Prodigal_v2.6.3 CDS 2936    3754    140.5   -   0   ID=1_3;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.601;conf=100.00;score=140.50;cscore=133.74;sscore=6.75;rscore=2.03;uscore=1.08;tscore=3.65;
Reverse-complemented:
CAKWEX010000332.1_r Prodigal_v2.6.3 CDS 12  830 136.1   +   0   ID=1_1;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=3-4bp;gc_cont=0.601;conf=100.00;score=136.12;cscore=133.74;sscore=2.37;rscore=-2.35;uscore=1.08;tscore=3.65;
CAKWEX010000332.1_r Prodigal_v2.6.3 CDS 914 1516    105.5   +   0   ID=1_2;partial=00;start_type=GTG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.580;conf=100.00;score=105.47;cscore=95.43;sscore=10.04;rscore=13.75;uscore=1.16;tscore=-4.87;
CAKWEX010000332.1_r Prodigal_v2.6.3 CDS 1588    3246    298.9   +   0   ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.636;conf=99.99;score=298.87;cscore=285.76;sscore=13.11;rscore=7.85;uscore=1.61;tscore=3.65;

Indeed, the gene with the GGA/GAG/AGG RBS motif has a spacer detected as 3-4bp when on the forward strand, and 5-10bp on the reverse strand. The contig in question starts with the following sequence:

GGATAGGCCCCATG...

so it has both a match in the 3-4bp range (AGG) and in the 5-10bp range (GGA), but since the 5-10bp spacer has a higher score it should be the one to be selected. This actually matters on the gene score, so it could cause some predictions to change.

The problem was coming from the loops in rbs_score which skip some positions before index 0; however, when there may be a partial match (as it is the case here, with a GGA motif right on the contig edge), the positions should not be skipped, and the decision to ignore some positions should be made by the shine_dalgarno_exact and shine_dalgarno_mm functions directly.

After applying the patch, the predictions are consistent independent of the directionality of the contig, the RBS spacers and hence the gene scores match:

Forward:
CAKWEX010000332.1   Prodigal_v2.6.3 CDS 520 2178    298.9   -   0   ID=1_1;partial=00;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.636;conf=99.99;score=298.87;cscore=285.76;sscore=13.11;rscore=7.85;uscore=1.61;tscore=3.65;
CAKWEX010000332.1   Prodigal_v2.6.3 CDS 2250    2852    105.5   -   0   ID=1_2;partial=00;start_type=GTG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.582;conf=100.00;score=105.47;cscore=95.43;sscore=10.04;rscore=13.75;uscore=1.16;tscore=-4.87;
CAKWEX010000332.1   Prodigal_v2.6.3 CDS 2936    3754    140.5   -   0   ID=1_3;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.602;conf=100.00;score=140.50;cscore=133.74;sscore=6.75;rscore=2.03;uscore=1.08;tscore=3.65;
Reverse-complemented:
CAKWEX010000332.1_r Prodigal_v2.6.3 CDS 12  830 140.5   +   0   ID=1_1;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.601;conf=100.00;score=140.50;cscore=133.74;sscore=6.75;rscore=2.03;uscore=1.08;tscore=3.65;
CAKWEX010000332.1_r Prodigal_v2.6.3 CDS 914 1516    105.5   +   0   ID=1_2;partial=00;start_type=GTG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.580;conf=100.00;score=105.47;cscore=95.43;sscore=10.04;rscore=13.75;uscore=1.16;tscore=-4.87;
CAKWEX010000332.1_r Prodigal_v2.6.3 CDS 1588    3246    298.9   +   0   ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.636;conf=99.99;score=298.87;cscore=285.76;sscore=13.11;rscore=7.85;uscore=1.61;tscore=3.65;