pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.02k stars 43 forks source link

Convert language to ISO 639-1 language code #198

Closed devWhyqueue closed 7 months ago

devWhyqueue commented 7 months ago

Lingua-Py is a valuable tool for language detection, but for many applications, it is essential to have language information represented in a standardized format. ISO 639-1 language codes are widely recognized and used across various industries and applications. By adding a feature to convert detected languages to their ISO 639-1 language codes, Lingua-Py can become even more versatile and user-friendly.

Usage suggestion:

text = "Bonjour tout le monde"
lang_detect = LanguageDetectorBuilder.from_all_languages().build()
lang = lang_detect.detect_language_of(text)
lang_iso = lang.iso_639_1

print(f"Detected Language: {lang}")
print(f"ISO 639-1 Language Code: {lang_iso}")
pemistahl commented 7 months ago

Hi @devWhyqueue, what you want is already possible. Each Language has the attributes iso_code_639_1 and iso_code_639_3.

>>> from lingua import Language
>>> Language.ENGLISH.iso_code_639_1
IsoCode639_1.EN
>>> Language.ENGLISH.iso_code_639_1.name
'EN'

I will check the documentation and improve it if this info is missing.

devWhyqueue commented 7 months ago

Unfortunately, this is not working for me.

The following code

from lingua_py import Language

def test_language_has_iso_code():
    assert Language.English.iso_code_639_1.name == "en"

raises an AttributeError:

test_nlp.py:5 (test_language_has_iso_code)
def test_language_has_iso_code():
>       assert Language.English.iso_code_639_1.name == "en"
E       AttributeError: 'builtins.Language' object has no attribute 'iso_code_639_1'

test_nlp.py:7: AttributeError
devWhyqueue commented 7 months ago

Sorry, for wasting your time. There are too many lingua packages on PyPI. Obviously, I installed the wrong one. Maybe consider moving the package name on PyPI lingua-language-detector to a more prominent position, though. As the repo's name is lingua-py I installed the package with that name.

pemistahl commented 7 months ago

Back then, I would have preferred to name the PyPI package just lingua but that name was already taken. The packages lingua-py and lingua-py-unofficial are third-party Python bindings to the Rust implementation of my library. I'm not the author of these. They had been created before I offered official Python bindings myself, eventually.

Nevertheless, I cannot spare you to read my documentation which clearly states the correct package lingua-language-detector. ;-)