Comments regarding speedup to `O(n+m+s^2)`

jeizenga / wfalm

Refinements of the WFA alignment algorithm with better complexity

MIT License

26 stars 0 forks source link

Comments regarding speedup to `O(n+m+s^2)` #5

Open RagnarGrootKoerkamp opened 2 years ago

RagnarGrootKoerkamp commented 2 years ago

The section on the speedup to O(n+m+s^2) should have a remark on whether you are assuming a finite or infinite alphabet. Implicitly in the context of pairwise alignment, I would assume a finite alphabet.

However, The Myers'86 paper already writes that building the suffix tree takes linear time for finite alphabets, giving the required O(N+M+D^2) in their terms. They don't explicitly mention this (I think), because they are supposedly doing LCS in the context of diff algorithms, where lines of code actually give an infinite alphabet.

Similarly, Landau-Vishkin'89 writes that the complexity is O(nk) for finding all matches with up to k mistakes of a pattern in a text for finite alphabets, reduced from O(nm) by using suffix trees as well. Their paper is not about global alignment, but applying their result to Ukkonen'85s global alignment algorithm directly gives the O(n+m+s^2) runtime.

jeizenga commented 2 years ago

The log factor in Myers' O(ND) paper actually comes from the least common ancestor data structure rather than the suffix tree. However, since publishing the preprint, I had discovered that the Landau-Vishkin algorithm uses a similar approach. Considering that it didn't provide a practical speed up, I think I will probably remove that aspect from the manuscript.

RagnarGrootKoerkamp commented 2 years ago

The log factor in Myers' O(ND) paper actually comes from the least common ancestor data structure rather than the suffix tree.

They write:

Second, a recent RAM-based algorithm for answering Q on-line queris of the LCA of vertices in a fixed V-vertex tree takes O(V+Q) time [6].

and they basically repeat this later, so my understanding is still that the (N+M) log(N+M)+D^2 logarithm is only needed for the suffix tree construction for infinite alphabets.

Considering that it didn't provide a practical speed up, I think I will probably remove that aspect from the manuscript.

Right makes sense. Nethertheless, do you have an idea how large n or s need to be before this becomes worthwhile?

jeizenga commented 2 years ago

Oh, nevermind then, I must be misremembering.

My expectation is that the tradeoff has less to do with N or S than with the properties of the sequences. Because the constants are so much higher on the suffix tree algorithm, there would only be a substantial speed-up if there are many long matches that can be reduced to O(1) queries. In high entropy sequences, the lengths of spurious matches are very short (expected length ~4/3 in uniform random sequences). To have many long matches, the sequences need to have a high rate of internal repetition.