mitiko / BWDPerf

BWD stands for Best Word Dictionary as it has the ability to be an optimal dictionary coder.
https://mitiko.github.io/BWDPerf
GNU General Public License v3.0
0 stars 1 forks source link

Standardize parsing #10

Open mitiko opened 3 years ago

mitiko commented 3 years ago

Parsing used to be done in 2 different ways during dictionary calculation and then after that for creating the sequence of word indices. Now parsing is the same, to avoid mistakes. According to the DRY (don't repeat yourself), this parsing code should be separated in a new method.

In fact, this would be beneficial, because in the future, we might consider a different ranking method and therefore a more efficient parsing method might be available.

TODO: I still need to read some of the papers on parsing done for dynamic coders like LZ77 and LZ78

mitiko commented 3 years ago

There's also a different kind of parsing done when counting the words. It is there for overlapping words. This is not very likely for most words in English, but with genetic alphabets, words overlap a lot. We should make that faster and also make a check if parsing is even needed, to speed up English text.

mitiko commented 3 years ago

Maybe some KMP word preprocessing. This could be slow.

mitiko commented 3 years ago

Moved the other issue to #30