Closed mitiko closed 3 years ago
The only solution I'm seeing is to stop pretending that the dictionary is a mix of words and characters and parse everything as words. This might be easier to implement now, that we've abstracted the BWDIndex
as its own class.
With the newest commit on the draft PR this shouuuuuld be fixed?
Well, it's kinda too slow to test it fully but shouldn't be an issue now.
Say we have the word
"WORD"
. One location it matches is for example:...abcxWORDyztv...
We have to update the counters for order1 and order0 models. The order0 model is simple - for each character in the word, lower the count by how many times the word occurs, then add a new symbol for this word with some initial count.Then we have 3 different types of updates: 1) In the word:
W -> O
O -> R
R -> D
Those are also not problematic 2) Before the word:x -> W
should be changed tox -> <w>
where<w>
is the word token, the new symbol This is done in 2 steps - lowering the count ofx -> W
and raising the count ofx -> <w>
3) After the word: Just like before the word, but we change the context now.D -> y
should be changed to<w> -> y
, again done in 2 steps.Updates 2) and 3) are problematic because we assume the symbol before or after the word is a character and not a word. For example, we can have the word
"_a"
and the word"nd"
. Whenever_and
appears we have to change the model to:"_a" -> "nd"
not"_a" -> "n"
This means we have to keep track of the parsed stream as we go.