pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.1k stars 44 forks source link

Randomness in language detection results #231

Open elnikkis opened 1 month ago

elnikkis commented 1 month ago

I'm encountering an issue with the randomness of results when using this library for language detection. The following code produces different detection results each time it's executed.

Code:

from lingua import Language, LanguageDetectorBuilder
detector = LanguageDetectorBuilder.from_all_languages().build()

text = '考え過ぎ(´・ω・`)'  # This is Japanese text

for i in range(10):
    print(f'Iter {i}')
    print(detector.detect_language_of(text))
    confidence_values = detector.compute_language_confidence_values(text)
    for confidence in confidence_values:
        if confidence.value > 0.0:
            print(confidence.language.name, confidence.value)

Output:

Iter 0
Language.CHINESE
CHINESE 0.6189690345603295
JAPANESE 0.38103096543967047
Iter 1
Language.JAPANESE
JAPANESE 1.0
Iter 2
Language.JAPANESE
JAPANESE 1.0
Iter 3
Language.CHINESE
JAPANESE 1.0
Iter 4
Language.JAPANESE
JAPANESE 1.0
Iter 5
Language.CHINESE
CHINESE 0.6189690345603293
JAPANESE 0.3810309654396707
Iter 6
Language.CHINESE
CHINESE 0.6189690345603293
JAPANESE 0.3810309654396707
Iter 7
Language.JAPANESE
CHINESE 0.6189690345603293
JAPANESE 0.3810309654396707
Iter 8
Language.JAPANESE
CHINESE 0.6189690345603293
JAPANESE 0.3810309654396707
Iter 9
Language.CHINESE
CHINESE 0.6189690345603293
JAPANESE 0.3810309654396707

This variability in results is problematic for reproducibility. Is there a way to fix or stabilize the output so that it remains consistent across multiple runs?

Version information:

pemistahl commented 1 month ago

This should definitely not happen. Thanks for letting me know. Does this happen with other languages, too? I will try to reproduce and fix it.

elnikkis commented 1 month ago

I noticed two additional points:

(1) In addition to the text shown in the initial issue, I found that misclassification also occurs with the following input texts. What these have in common is that they all contain emoticons. The symbols in these emoticons include characters that are not Japanese.

ご尊顔(ㅅ˘ㅂ˘)✨✨
考え過ぎ(´・ω・`)
大往生でしょ(´・ω・`)
大正義ダンテ98(´・ω・`)

(2) When preparing the detector, I tried four different patterns. The fourth detector works correctly:

detector = LanguageDetectorBuilder.from_all_languages().build()  # buggy
detector = LanguageDetectorBuilder.from_all_spoken_languages().build()  # buggy
detector = LanguageDetectorBuilder.from_languages(*Language.all()).build()  # buggy
detector = LanguageDetectorBuilder.from_languages(Language.JAPANESE, Language.CHINESE).build()  # consistent result

This suggests that there might be an issue with the detection data for one or more languages.

I hope this information is helpful to you.