Closed mitiko closed 3 years ago
I'm proposing to use an interface IBWDRanking
for ranking.
The idea is this new object (that implements the interface) will keep its own state - top-ranked words, covered characters, and others, while BWD focuses more on searching for better dictionaries with new ideas like self-exclusive words and future trees.
I rewrote the whole counting, ranking, and splitting to use the suffix array, but I'm hitting a significant slow down instead. Maybe taking the LCP will give us faster times, as we don't have to match words on each recount?
Okay, so re-counting is quite difficult.. I hated checking all the indexes but it still doesn't pass all tests. Right now, there're some exceptions thrown on the first/full counting for the tests with multiple dictionaries. I'm guessing my matching code is wrong somewhere. Thus, I'm moving this to the dev branch, as it's not trivial.
I'm guessing the LCP will be quite helpful, it's supposed to speed up SA search as well (which we need for the re-count). It's going to simplify the matching code on initial count and shouldn't take that much space. If needed, we can limit the maxWorddSize to 256 and store the LCP as a byte[] instead of int[]. My only concern is how it would work with my optimization to the suffix array construction (limiting to suffixes of length ceil(log2(maxWordLength)))
Well, yeah, the speed-up of limiting suffixes size in the SA does break the open-source LCP construction algs. I'm removing this optimization for now, it's not the SA construction that slows us down anyway.
I noticed a while ago that there are an awful lot of words that only occur once. This is super inefficient and we're not even going to rank these words higher than 0, so we won't ever use them. So I deleted them. Recounting now works and it integrates great with the ranking.
All the matching just to count the words and keep that count seems dubious.
For all ranking-based optimization techniques we only really need the first x top-ranked words.
With the suffix array, where words are just sorted and a bitvector of occurrence we can count words very fast. This way we're actually coupling the ranking, the count, and the matching which is basically where the perf will come from.