hyattpd / Prodigal

Prodigal Gene Prediction Software
GNU General Public License v3.0
432 stars 85 forks source link

Fix buffer initialization in SD motif detection code #100

Closed althonos closed 1 year ago

althonos commented 1 year ago

Hi again !

A very detailed report by @pchaumeil in https://github.com/althonos/pyrodigal/issues/28 helped me discover a new bug in the Prodigal source. This is affecting the detection of Shine-Dalgarno motifs for start codons on the reverse strand, where the searched region may go out-of-bound.

Overview

In shine_dalgarno_mm and shine_dalgarno_exact, the match buffer stores a score for each position of the motif being searched. In the event the region overlaps with the sequence edge, only the sequence characters should be matched. However, an issue with indexing causes the initialization not to happen for ahead-of-start nucleotides, and the match buffer may contain random bytes. This can lead to the wrong RBS motif being identified, and occasionally some genes having wrong coordinates.

Fix

Fix initialization of the match buffer so that all positions are set to a default mismatch value (-10). In the code that compares the 6-base region to AGGAGG, positions before 0 are skipped, and so match[i] may no be initialized for all values of i:

  for(i = 0; i < limit; i++) {
    if(pos+i < 0) continue;
    if(i%3 == 0 && is_a(seq, pos+i) == 1) match[i] = 2.0;
    else if(i%3 != 0 && is_g(seq, pos+i) == 1) match[i] = 3.0;
    else match[i] = -10.0;
  }

Initializing the array to -10 makes the undefined behaviour disappear. In fact, since match is stack-allocated, if may contain remnants of a previous call, like 2.0 or 3.0, which made the error hard to debug in the first place.

Example

In the assembly GCA_934838455.1, there is a contig (CAKWEX010000332.1) which ends with the following nucleotide sequence:

...CATGGGGCCTATCC
   ^^^        ^^^
   (1)        (2)

It contains a start codon (1) and a RBS motif (2) on the reverse strand. However, after running Prodigal on this contig, the predicted gene is:

>CAKWEX010000332.1_3 # 2936 # 3754 # -1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.602
MSNHFEGLGKTWLTLLNDPEKEVPAVVMQVMKEGKTRDCWQRKDSKEETMVLAWPVETGF
RAGVTVHGNAGDQLRPVSTYPLLEGAPNDMTVNETYLWQNETEGEVSATCNEGANPLWFY
SPFLFRDRENLTPGVRHTFLIAGLAYGLRRALLDEMTITEGVEYERYVAEWLAQNPGKTR
LDVPQLTVDLRGARIVVPGDVASEYQIRVPVTSVEEMHIQNEKIYMLIVEFGLNTPNPLR
FPLYAPERVCKIVPQAGDEIDAIIWLQGRIID*

As you can see, the RBS motif is AGGA; however, there is no AGGA in the sequence upstream of the start codon, only GGA. The missing A is actually an artifact of the match buffer not being initialized. After applying the fix, the gene is predicted with a GGA motif, as expected:

>CAKWEX010000332.1_3 # 2936 # 3754 # -1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.602
MSNHFEGLGKTWLTLLNDPEKEVPAVVMQVMKEGKTRDCWQRKDSKEETMVLAWPVETGF
RAGVTVHGNAGDQLRPVSTYPLLEGAPNDMTVNETYLWQNETEGEVSATCNEGANPLWFY
SPFLFRDRENLTPGVRHTFLIAGLAYGLRRALLDEMTITEGVEYERYVAEWLAQNPGKTR
LDVPQLTVDLRGARIVVPGDVASEYQIRVPVTSVEEMHIQNEKIYMLIVEFGLNTPNPLR
FPLYAPERVCKIVPQAGDEIDAIIWLQGRIID*
althonos commented 1 year ago

@hyattpd: Do you think you could have a look at #101 and #102 too? 🙏