goodmami / wn

A modern, interlingual wordnet interface for Python
https://wn.readthedocs.io/
MIT License
199 stars 19 forks source link

Robust language codes #100

Closed goodmami closed 2 years ago

goodmami commented 3 years ago

Now that langcodes version 3.0 does not have a dependency problem and it's a bit lighter, it would be useful for making lang=... filters more robust by normalizing language codes. For instance, both en and eng, maybe even en-US, would resolve to the same code and would be able to load the relevant lexicons.

goodmami commented 3 years ago

At first I thought about inserting another column with the normalized language tag, but since queries about lexicons by language generate the list of languages anyway, something like langcodes.closest_match() could work:

>>> import langcodes
>>> langcodes.closest_match('eng', ['en', 'de'])
('en', 0)
goodmami commented 2 years ago

The motivation for this was to better accommodate the switch from ISO 639-3 alpha3 codes to BCP-47 codes, but now I think this is a bad idea:

Users wanting robust language codes could use langcodes or similar on their own. Such a feature might make sense on an application making use of the Wn library, rather than in the library itself.