To restrict computation to a banded diagonal of the dynamic programming matrix the dprog function computes the offset of the furthest node to consider when scoring a node. In the case of giant ORFs, this limit must sometimes be increased to make sure the complete ORF is always accounted for. In the current code, this is done in a faulty way and causes connections to be sometimes recomputed for the entire contig, because min is always reset to zero in such cases.
I tested the patch with 50 genomes and the results were the same to the main branch (according to the scores file output); however on a large genome (Sorangium cellulosum) the patch saves 9s (40%) of runtime:
Benchmark 1: ./prodigal -i GCF_000067165.1_ASM6716v1_genomic.fna -o /dev/null
Time (mean ± σ): 15.744 s ± 0.924 s [User: 15.639 s, System: 0.069 s]
Range (min … max): 14.556 s … 16.704 s 4 runs
Benchmark 2: ./prodigal_orig -i GCF_000067165.1_ASM6716v1_genomic.fna -o /dev/null
Time (mean ± σ): 22.440 s ± 1.216 s [User: 22.296 s, System: 0.076 s]
Range (min … max): 21.249 s … 24.135 s 4 runs
Summary
./prodigal -i GCF_000067165.1_ASM6716v1_genomic.fna -o /dev/null ran
1.43 ± 0.11 times faster than ./prodigal_orig -i GCF_000067165.1_ASM6716v1_genomic.fna -o /dev/null
To restrict computation to a banded diagonal of the dynamic programming matrix the
dprog
function computes the offset of the furthest node to consider when scoring a node. In the case of giant ORFs, this limit must sometimes be increased to make sure the complete ORF is always accounted for. In the current code, this is done in a faulty way and causes connections to be sometimes recomputed for the entire contig, becausemin
is always reset to zero in such cases.I tested the patch with 50 genomes and the results were the same to the main branch (according to the scores file output); however on a large genome (Sorangium cellulosum) the patch saves 9s (40%) of runtime: