Order1 entropy ranking model isn't updated correctly for consecutive words

mitiko commented 3 years ago

Say we have the word "WORD". One location it matches is for example: ...abcxWORDyztv... We have to update the counters for order1 and order0 models. The order0 model is simple - for each character in the word, lower the count by how many times the word occurs, then add a new symbol for this word with some initial count.

Then we have 3 different types of updates: 1) In the word: W -> O O -> R R -> D Those are also not problematic 2) Before the word: x -> W should be changed to x -> <w> where <w> is the word token, the new symbol This is done in 2 steps - lowering the count of x -> W and raising the count of x -> <w> 3) After the word: Just like before the word, but we change the context now. D -> y should be changed to <w> -> y, again done in 2 steps.

Updates 2) and 3) are problematic because we assume the symbol before or after the word is a character and not a word. For example, we can have the word "_a" and the word "nd". Whenever _and appears we have to change the model to: "_a" -> "nd" not "_a" -> "n"

This means we have to keep track of the parsed stream as we go.

mitiko commented 3 years ago

The only solution I'm seeing is to stop pretending that the dictionary is a mix of words and characters and parse everything as words. This might be easier to implement now, that we've abstracted the BWDIndex as its own class.

mitiko commented 3 years ago

With the newest commit on the draft PR this shouuuuuld be fixed?

mitiko commented 3 years ago

Well, it's kinda too slow to test it fully but shouldn't be an issue now.

mitiko / BWDPerf

Order1 entropy ranking model isn't updated correctly for consecutive words #32