pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.15k stars 45 forks source link

Multiple Function result discrepancy #228

Open EvGe22 opened 5 months ago

EvGe22 commented 5 months ago

Given a text in Ukrainian, two methods provide two completely different results.

detector = LanguageDetectorBuilder.from_all_languages().build()
string = "Що найбільше подобається читачам у жанрі \"Фентезі\"?"

print(detector.compute_language_confidence_values(string))
>>> [ConfidenceValue(language=Language.KAZAKH, value=1), ConfidenceValue(language=Language.AFRIKAANS, value=0), ConfidenceValue(language=Language.ALBANIAN, value=0), ...] 

print(detector.detect_multiple_languages_of(string))
>>> [DetectionResult(start_index=0, end_index=51, word_count=7, language=Language.UKRAINIAN)]
pemistahl commented 3 months ago

Both methods use different algorithms, so this can happen. I will try to improve them with each new release.