Questions regarding table 1 / CPU comparison

In table 1, you compare some iCPU mplementations, but it is not quite clear to me exactly which tools you use.

Edlib

Edlib implements an $O(nd)$ algorithm based on the exponential search in Ukkonen'85. This part is clear :smile:

$O(ND)$

In the background section you mention DAligner, but my limited understanding is that this is a mapper, not a global aligner. In the experimental setup you only link [37], Myers'86 original paper, but that does not come with an implementation itself.

I'm assuming this is some implementation of Myers'86 for edit distance (as opposed to LCS, which is what Myers'86 is originally about). That means it has $O(nd)$ worst case complexity (as the name indicates) and $O(d^2)$ expected complexity (as mentioned in section 4.1 of Myers'86).

BPM

From the background section, I think this is an abbreviation of Myers bit vector algorithm. You list two such tools, Edlib and BGSA. Does this mean that BPM in the table refers to BGSA?

BGSA seems to be a score-only tool that implements the classic $O(n^2)$ NW algorithm in parallel via SIMD for multiple alignments tasks. This way, it achieves a 30x speedup over edlib in their paper. I don't see this in your table, so maybe my assumption that you are using BGSA here is wrong?

Table 1 evaluation

In section IV.C, you write:

Unsurprisingly, DP-based implementations (i.e., BPM, Edlib, NVBio, and GASAL2) are insensitive to the alignment error, performing the same amount of computations to align similar sequences as to align very divergent ones.

Edlib has an $O(nd)$ runtime, and in fact, I am quite surprised to not see a dependency of the runtime on the error rate. The dependency may be hidden by the (IO) overhead of dealing with very short sequences. What are your thoughts about this?

Similarly, $O(ND)$ should have an $O(d^2)$ expected runtime, but again, barely any dependency on $e$ is present in the runtime. Again, this may imply that the measured times are mostly IO overhead. In fact, for length 150 and 300 higher error rates give faster runtimes?!

Assuming that BPM is an $O(n^2)$ algorithm, there the lack of dependency on $e$ indeed makes sense.

You continue:

As a result, the performance of classical DP-based algorithms is heavily constrained by the sequence length and not by the sequences homology. For that reason, some tools, like Edlib, implement heuristics that prune the DP computations at the expense of potentially missing the optimal alignment (note the reduction in the execution time when aligning sequences of 1000 nucleotides with e>=5%).

While it is true that Edlib supports banded alignment, at its core it is an exact algorithm. Your experimental setup does not mention the explicit use of banding in Edlib, so assuming you use it with default parameters, it should return an exact result.

Thus, the much (2.5x) slower datapoint at length=1000, e=2% is quite surprising to me. In case you do indeed use some form of banding, could you explain how this could lead to faster runtimes for higher error rates, and why this effect does not happen for shorter sequences? Are you sure this is not some kind of measurement error?

quim0 / WFA-GPU

Questions regarding table 1 / CPU comparison #3