rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

Use langcodes when tokenizing again #49

Closed rspeer closed 7 years ago

rspeer commented 7 years ago

We previously removed the call to langcodes that normalizes language codes when tokenizing, because normalizing a language code would require connecting to a SQLite DB, which was too heavy and too non-thread-safe for an operation as simple as tokenization.

This had the implication that if you used an overlong language code, such as chi for Chinese (and I have seen data sources that use this language code), the wrong tokenization would happen, and you would get inconsistent tokens and therefore inconsistent word frequencies.

langcodes has been redesigned, and normalizing a language code is now an easy operation, so we can put this back.