This has apparently been the case for a while, but we should fix it in an update:
The tokenize function assumes it's getting a nicely-normalized language code. But when looking up word frequencies, we don't actually normalize the language code until later, and we do it inside get_frequency_list without returning it.
I can think of an ugly fix we could make right away, or a nice fix that would require a change to langcodes to make simple cases of language matching faster.
This has apparently been the case for a while, but we should fix it in an update:
The
tokenize
function assumes it's getting a nicely-normalized language code. But when looking up word frequencies, we don't actually normalize the language code until later, and we do it insideget_frequency_list
without returning it.I can think of an ugly fix we could make right away, or a nice fix that would require a change to
langcodes
to make simple cases of language matching faster.