I noticed while addressing #100 that the GC% computed for genes in the reverse strand was often wrong by a small margin, despite the correct gene sequence being extracted, pointing at an indexing error. After checking for out-of-bound reads, I noticed that in calc_orf_gc the loop would read past sequence end in the following part:
Indeed, on the reverse strand, last[fr] is set to the index of the STOP codon; because for reverse-strand codon this is always the index of the last nucleotide, not the first, the iteration should start 1 nucleotide later, not 3.
Fix
Start the iteration at the right coordinates :smile:
Example
Taking the same contig CAKWEX010000332.1 as in #100, I ran Prodigal on both the contig and its reverse complement; the genes predicted in both cases matched in sequences, but the GC% didn't match; namely, the GC content was wrong when the genes were on the reverse strand (i changed the gc_cont precision so that the difference is easier to see):
Hi again!
Overview
I noticed while addressing #100 that the GC% computed for genes in the reverse strand was often wrong by a small margin, despite the correct gene sequence being extracted, pointing at an indexing error. After checking for out-of-bound reads, I noticed that in
calc_orf_gc
the loop would read past sequence end in the following part:Indeed, on the reverse strand,
last[fr]
is set to the index of the STOP codon; because for reverse-strand codon this is always the index of the last nucleotide, not the first, the iteration should start 1 nucleotide later, not 3.Fix
Start the iteration at the right coordinates :smile:
Example
Taking the same contig
CAKWEX010000332.1
as in #100, I ran Prodigal on both the contig and its reverse complement; the genes predicted in both cases matched in sequences, but the GC% didn't match; namely, the GC content was wrong when the genes were on the reverse strand (i changed thegc_cont
precision so that the difference is easier to see):After applying the fix, the GC-content is consistent independently of whether the gene is on the direct or reverse strand: