Robust language codes - Githubissues

goodmami commented 3 years ago

Now that langcodes version 3.0 does not have a dependency problem and it's a bit lighter, it would be useful for making lang=... filters more robust by normalizing language codes. For instance, both en and eng, maybe even en-US, would resolve to the same code and would be able to load the relevant lexicons.

goodmami commented 3 years ago

At first I thought about inserting another column with the normalized language tag, but since queries about lexicons by language generate the list of languages anyway, something like langcodes.closest_match() could work:

>>> import langcodes
>>> langcodes.closest_match('eng', ['en', 'de'])
('en', 0)

goodmami commented 2 years ago

The motivation for this was to better accommodate the switch from ISO 639-3 alpha3 codes to BCP-47 codes, but now I think this is a bad idea:

we should encourage users to be specific with a lexicon specifier (e.g., oewn:2021)
the 3-letter codes used in the NLTK can be mapped with a simple dictionary in any NLTK compatibility module
using langcodes adds a dependency
the complexity increases the maintenance burden
there is a potential for a language code to resolve to something unintended (e.g., if a user has two Mandarin Chinese wordnets, one in simplified (cmn-Hans) and one in traditional (cmn-Hant))
it's easy to get the actual language codes used via wn.lexicons()

Users wanting robust language codes could use langcodes or similar on their own. Such a feature might make sense on an application making use of the Wn library, rather than in the library itself.

goodmami / wn

Robust language codes #100