A very detailed report by @pchaumeil in https://github.com/althonos/pyrodigal/issues/28 helped me discover a new bug in the Prodigal source. This is affecting the detection of Shine-Dalgarno motifs for start codons on the reverse strand, where the searched region may go out-of-bound.
Overview
In shine_dalgarno_mm and shine_dalgarno_exact, the match buffer stores a score for each position of the motif being searched. In the event the region overlaps with the sequence edge, only the sequence characters should be matched. However, an issue with indexing causes the initialization not to happen for ahead-of-start nucleotides, and the match buffer may contain random bytes. This can lead to the wrong RBS motif being identified, and occasionally some genes having wrong coordinates.
Fix
Fix initialization of the match buffer so that all positions are set to a default mismatch value (-10). In the code that compares the 6-base region to AGGAGG, positions before 0 are skipped, and so match[i] may no be initialized for all values of i:
Initializing the array to -10 makes the undefined behaviour disappear. In fact, since match is stack-allocated, if may contain remnants of a previous call, like 2.0 or 3.0, which made the error hard to debug in the first place.
Example
In the assembly GCA_934838455.1, there is a contig (CAKWEX010000332.1) which ends with the following nucleotide sequence:
...CATGGGGCCTATCC
^^^ ^^^
(1) (2)
It contains a start codon (1) and a RBS motif (2) on the reverse strand. However, after running Prodigal on this contig, the predicted gene is:
As you can see, the RBS motif is AGGA; however, there is no AGGA in the sequence upstream of the start codon, only GGA. The missing A is actually an artifact of the match buffer not being initialized. After applying the fix, the gene is predicted with a GGA motif, as expected:
Hi again !
A very detailed report by @pchaumeil in https://github.com/althonos/pyrodigal/issues/28 helped me discover a new bug in the Prodigal source. This is affecting the detection of Shine-Dalgarno motifs for start codons on the reverse strand, where the searched region may go out-of-bound.
Overview
In
shine_dalgarno_mm
andshine_dalgarno_exact
, thematch
buffer stores a score for each position of the motif being searched. In the event the region overlaps with the sequence edge, only the sequence characters should be matched. However, an issue with indexing causes the initialization not to happen for ahead-of-start nucleotides, and thematch
buffer may contain random bytes. This can lead to the wrong RBS motif being identified, and occasionally some genes having wrong coordinates.Fix
Fix initialization of the
match
buffer so that all positions are set to a default mismatch value (-10). In the code that compares the 6-base region toAGGAGG
, positions before0
are skipped, and somatch[i]
may no be initialized for all values ofi
:Initializing the array to -10 makes the undefined behaviour disappear. In fact, since
match
is stack-allocated, if may contain remnants of a previous call, like2.0
or3.0
, which made the error hard to debug in the first place.Example
In the assembly GCA_934838455.1, there is a contig (
CAKWEX010000332.1
) which ends with the following nucleotide sequence:It contains a start codon (1) and a RBS motif (2) on the reverse strand. However, after running Prodigal on this contig, the predicted gene is:
As you can see, the RBS motif is
AGGA
; however, there is noAGGA
in the sequence upstream of the start codon, onlyGGA
. The missingA
is actually an artifact of thematch
buffer not being initialized. After applying the fix, the gene is predicted with aGGA
motif, as expected: