We previously removed the call to langcodes that normalizes language codes when tokenizing, because normalizing a language code would require connecting to a SQLite DB, which was too heavy and too non-thread-safe for an operation as simple as tokenization.
This had the implication that if you used an overlong language code, such as chi for Chinese (and I have seen data sources that use this language code), the wrong tokenization would happen, and you would get inconsistent tokens and therefore inconsistent word frequencies.
langcodes has been redesigned, and normalizing a language code is now an easy operation, so we can put this back.
We previously removed the call to
langcodes
that normalizes language codes when tokenizing, because normalizing a language code would require connecting to a SQLite DB, which was too heavy and too non-thread-safe for an operation as simple as tokenization.This had the implication that if you used an overlong language code, such as
chi
for Chinese (and I have seen data sources that use this language code), the wrong tokenization would happen, and you would get inconsistent tokens and therefore inconsistent word frequencies.langcodes
has been redesigned, and normalizing a language code is now an easy operation, so we can put this back.