pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.1k stars 44 forks source link

detect_multiple_languages_of predicts incorrect languages #109

Closed jordimas closed 1 year ago

jordimas commented 1 year ago

Using version 1.3.1

Using a text that is in Catalan language only, that does not contain any fragments from other languages, and that it's very standard kind of text, _detect_multiple_languagesof method detects: CATALAN, SOMALI, LATIN, FRENCH, SPANISH and PORTUGUESE. The expectation is that should report that the full text is CATALAN.

Code to reproduce the problem:

from lingua import Language, LanguageDetectorBuilder, IsoCode639_1

with open('text-catalan.txt') as fh:
    text = fh.read()

    detector = LanguageDetectorBuilder.from_all_languages().build()

    for result in detector.detect_multiple_languages_of(text):
        print(f"{result.language.name}")

Related to this problem also is that _detect_languageof and _detect_multiple_languagesof predict different languages over the same text. Below an example on the same input _detect_languageof predicts Catalan and _detect_multiple_languagesof predicts Tsonga.

My expectation is that both methods will predict the same given the same input.

Code sample:

from lingua import Language, LanguageDetectorBuilder, IsoCode639_1

with open('china.txt') as fh:
    text = fh.read()

    detector = LanguageDetectorBuilder.from_all_languages().build()

    result = detector.detect_language_of(text)
    print(f"detect_language_of prediction: {result}")

    for result in detector.detect_multiple_languages_of(text):
        print(f"detect_language_of prediction: {result.language.name}")
jordimas commented 1 year ago

File used as input on the first example: text-catalan.txt

jordimas commented 1 year ago

File used as input on the second example: china.txt

pemistahl commented 1 year ago

As stated in the documentation, the detection of multiple languages is experimental, so please do not expect reliable results usable in production. The current algorithm tries to identify a language for each single word and then merge contiguous words having the same language, basically. I will certainly try to improve the algorithm in later releases.