Open mitiko opened 3 years ago
There's also a different kind of parsing done when counting the words. It is there for overlapping words. This is not very likely for most words in English, but with genetic alphabets, words overlap a lot. We should make that faster and also make a check if parsing is even needed, to speed up English text.
Maybe some KMP word preprocessing. This could be slow.
Moved the other issue to #30
Parsing used to be done in 2 different ways during dictionary calculation and then after that for creating the sequence of word indices. Now parsing is the same, to avoid mistakes. According to the DRY (don't repeat yourself), this parsing code should be separated in a new method.
In fact, this would be beneficial, because in the future, we might consider a different ranking method and therefore a more efficient parsing method might be available.
TODO: I still need to read some of the papers on parsing done for dynamic coders like LZ77 and LZ78