rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

Inconsistent language-code strings lead to inconsistent normalization #36

Closed rspeer closed 7 years ago

rspeer commented 8 years ago

This has apparently been the case for a while, but we should fix it in an update:

The tokenize function assumes it's getting a nicely-normalized language code. But when looking up word frequencies, we don't actually normalize the language code until later, and we do it inside get_frequency_list without returning it.

I can think of an ugly fix we could make right away, or a nice fix that would require a change to langcodes to make simple cases of language matching faster.

rspeer commented 7 years ago

This was fixed by #49.