pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.1k stars 44 forks source link

Bad detection in common word #94

Closed Jourdelune closed 1 year ago

Jourdelune commented 1 year ago

Hello, I need to detect language in user generated content, it's for a chat. I have tested this library but the library have strange result in short text, for exemple the word hello:

from lingua import Language, LanguageDetectorBuilder

languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

text = """
Hello
"""
confidence_values = detector.compute_language_confidence_values(text.strip())
for language, value in confidence_values:
    print(f"{language.name}: {value:.2f}")

return spanich (but the correct language is English)

SPANISH: 1.00
ENGLISH: 0.95
FRENCH: 0.87
GERMAN: 0.82

Do you know some tips to have better result for detecting language on user generated content?

pemistahl commented 1 year ago

Pure statistical approaches to language detection are never 100% correct. The letter sequence in the word 'hello' is very common in Spanish, so the algorithm thinks it's Spanish as the probability for Spanish is greater than the probability for English.

Feed longer strings into the detector. Then you will get more reliable results.