mitiko / BWDPerf

BWD stands for Best Word Dictionary as it has the ability to be an optimal dictionary coder.
https://mitiko.github.io/BWDPerf
GNU General Public License v3.0
0 stars 1 forks source link

Parsing words faster #30

Closed mitiko closed 3 years ago

mitiko commented 3 years ago

Parsing the locations of words is a major task - we need to sort the array of locations (and we do that for all matches). For text, we don't need a sorted array of locations for most words as they can't interfere with themselves. The question is "is it faster to pre-process every word to check if it can interfere with itself vs sorting locations and parsing" The answer is probably "maybe". There is a line roughly dependant on locations' count after which it becomes more efficient to do the check. That's going to be a minor slow-down, I'll allow it for text but we need a text switch for biodata where overlaps are common.

Is there a faster approach to parsing? The best I can think of is some faster data access/matching by reordering the bitvector to align with the suffix array so we can skip a lot of words fast. Then we take only the available locations, sort them and do the interference checking.

mitiko commented 3 years ago

I can see how the algorithm would work: We start with the last character in the string. For some suffix of the word to be a prefix, the last character must be repeated somewhere. We find the first time it is repeated (starting from the beginning). If it doesn't repeat, no need to sort arrays. If it does repeat, we check the previous character. If it matches with the second to last character or is the beginning of a word we should probably do the parsing.

It's not going to be foolproof - we still will sort the array of locations for some words that don't interfere with themselves, but in exchange, this check will be faster for all words. And most words from a text are excluded this way.

mitiko commented 3 years ago

I tested it, but the time saved by not sorting is compensated by the time to check if parsing is needed. Technically that can be cached but, there's really no need at this point.