kvark / dark-archon

[old] Fast BWT-based compressor
3 stars 1 forks source link

Where is the correct suffix array? #20

Closed stasoid closed 3 years ago

stasoid commented 3 years ago

Hello, I am trying to use archon4 for computing suffix array, but after calling compute() variable p does not hold correct suffix array. For example, for input "abracadabra" suffix array should be {10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2} or {11, 8, 1, 4, 6, 9, 2, 5, 7, 10, 3} with 1-based indexes, but p == {1, 6, 8, 4, 11, 2, 9, 5, 7, 3, 10}. How can I obtain the correct suffix array?

Also, the same question for archon7, which has ar.P == {6, 8, 11, 4, 1, 9, 2, 5, 7, 10, 3}.

kvark commented 3 years ago

Wow, hello!

The main application for the constructed suffix array (back when I was working on it) was BWT. For its purposes, the rules of SA construction could be slightly bent in order to ease the construction algorithm.

One of such changes is assuming that there is a terminator after the input string, so instead of working on "abracadabra" the algorithm actually works on "abracadabra$", where $ - the terminator - could be either lower than all symbols or higher, based on what was more convenient.

Another trick that Archon4 (at least) did was considering the source to be reversed, so that we could do full dword comparisons on the substrings. There is a REVERSED define to control this.

If you combine these two, then the actual source being processed was: "$abracadabra", where $ is the lowest symbol, but the sub-strings going right to left instead of left-to-right. The lowest substring is a$, then adacarba$, then acarba$, etc, which makes it "{1, 6, 8, 4, 11, 2, 9, 5, 7, 3, 10}" for Archon4.

The only difference with Archon7 is that this one considered "$" to be higher than all other symbols. So suffix "1" became the last of the "a..." substrings, and so forth.

stasoid commented 3 years ago

Thank you for the detailed answer. I thought there exists simple transformation from p to what I need, but as it turns out some nontrivial code modifications are required. So I better off using libdivsufsort then.