mitiko / BWDPerf

BWD stands for Best Word Dictionary as it has the ability to be an optimal dictionary coder.
https://mitiko.github.io/BWDPerf
GNU General Public License v3.0
0 stars 1 forks source link

Entropy change calculation is wrong for words with repeating symbols #35

Closed mitiko closed 3 years ago

mitiko commented 3 years ago

If the character 'a' appears twice in a word, the simple loop will calculate the entropy change as 2cx log(cx) - 2(cx-cw) log(cx-cw) when in fact it should be cx log(cx) - 2(cx-cw) log(2(cx-cw))

This is definitely a slowdown. I suppose we can create a dictionary. Then if the count matches the word length, we're fine, otherwise, we add a term to the rank based on the occurrence count of each character. Or, since we always create a dictionary we can work over that easily.