hyattpd / Prodigal

Prodigal Gene Prediction Software
GNU General Public License v3.0
432 stars 85 forks source link

Fix indexing error for reverse strand in `calc_orf_gc` #101

Open althonos opened 1 year ago

althonos commented 1 year ago

Hi again!

Overview

I noticed while addressing #100 that the GC% computed for genes in the reverse strand was often wrong by a small margin, despite the correct gene sequence being extracted, pointing at an indexing error. After checking for out-of-bound reads, I noticed that in calc_orf_gc the loop would read past sequence end in the following part:

      for(j = last[fr]+3; j <= nod[i].ndx; j+=3)
        gc[fr] += is_gc(seq, j) + is_gc(seq, j+1) + is_gc(seq, j+2);

Indeed, on the reverse strand, last[fr] is set to the index of the STOP codon; because for reverse-strand codon this is always the index of the last nucleotide, not the first, the iteration should start 1 nucleotide later, not 3.

Fix

Start the iteration at the right coordinates :smile:

Example

Taking the same contig CAKWEX010000332.1 as in #100, I ran Prodigal on both the contig and its reverse complement; the genes predicted in both cases matched in sequences, but the GC% didn't match; namely, the GC content was wrong when the genes were on the reverse strand (i changed the gc_cont precision so that the difference is easier to see):

               gene strand    true_gc    fwd_gc    bwd_gc
CAKWEX010000332.1_1     -1  63.592526  0.635925  0.635925
CAKWEX010000332.1_2     -1  58.043118  0.582090  0.580431
CAKWEX010000332.1_3     -1  60.073260  0.601954  0.600733

After applying the fix, the GC-content is consistent independently of whether the gene is on the direct or reverse strand:

               gene strand    true_gc    fwd_gc    bwd_gc
CAKWEX010000332.1_1     -1  63.592526  0.635925  0.635925
CAKWEX010000332.1_2     -1  58.043118  0.580431  0.580431
CAKWEX010000332.1_3     -1  60.073260  0.600733  0.600733
althonos commented 1 year ago

@hyattpd : Do you think you could merge #100 and this PR as well? Thanks :smiley: