inab / trimal

A tool for automated alignment trimming in large-scale phylogenetic analyses. Development version: 2.0
https://trimal.readthedocs.io/
GNU General Public License v3.0
175 stars 41 forks source link

Improve performance of `Similarity::calculateVectors` #66

Closed althonos closed 2 years ago

althonos commented 2 years ago

Hi there!

While working on althonos/pytrimal I did some thorough profiling of the code, and I identified some critical sections that could be improved. In particular, I noticed that the code in Similarity::calculateVectors was sub-optimal, because it was repeatedly calling similarityMatrix::getDistance with the same sequence characters, and the check for invalid/incorrect symbols seems to have a high performance impact.

To fix this, I added two buffers to store column data; the first one for the sequence itself, storing uppercase column characters to reduce the number of utils::toUpper calls; the other one to store the indices of gapped/indeterminate characters. The sequence characters for a column are checked once when the column is copied; after that, the distance matrix is indexed directly, without checking character ranges.

I used valgrind to count cycles on a run of trimAl in strict mode on example.073.AA.strNOG.ENOG411BFCW.fasta, here are the results in number of cycles:

Object cycles (before PR) cycles (after PR)
Similarity::calculateVectors (self) 3,132,810,840 1,652,854574
Similarity::calculateVectors (incl) 8,281,641,990 3,240,879,580
trimal (total) 13,504,155,928 8,463,690,891
althonos commented 2 years ago

Superseded by #69 .