While working on althonos/pytrimal I did some thorough profiling of the code, and I identified some critical sections that could be improved. In particular, I noticed that the code in Similarity::calculateVectors was sub-optimal, because it was repeatedly calling similarityMatrix::getDistance with the same sequence characters, and the check for invalid/incorrect symbols seems to have a high performance impact.
To fix this, I added two buffers to store column data; the first one for the sequence itself, storing uppercase column characters to reduce the number of utils::toUpper calls; the other one to store the indices of gapped/indeterminate characters. The sequence characters for a column are checked once when the column is copied; after that, the distance matrix is indexed directly, without checking character ranges.
I used valgrind to count cycles on a run of trimAl in strict mode on example.073.AA.strNOG.ENOG411BFCW.fasta, here are the results in number of cycles:
Hi there!
While working on althonos/pytrimal I did some thorough profiling of the code, and I identified some critical sections that could be improved. In particular, I noticed that the code in
Similarity::calculateVectors
was sub-optimal, because it was repeatedly callingsimilarityMatrix::getDistance
with the same sequence characters, and the check for invalid/incorrect symbols seems to have a high performance impact.To fix this, I added two buffers to store column data; the first one for the sequence itself, storing uppercase column characters to reduce the number of
utils::toUpper
calls; the other one to store the indices of gapped/indeterminate characters. The sequence characters for a column are checked once when the column is copied; after that, the distance matrix is indexed directly, without checking character ranges.I used
valgrind
to count cycles on a run of trimAl in strict mode onexample.073.AA.strNOG.ENOG411BFCW.fasta
, here are the results in number of cycles:Similarity::calculateVectors
(self)Similarity::calculateVectors
(incl)trimal
(total)