Bad detection in common word

pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text

Apache License 2.0

1.1k stars 44 forks source link

Hello, I need to detect language in user generated content, it's for a chat. I have tested this library but the library have strange result in short text, for exemple the word hello:

from lingua import Language, LanguageDetectorBuilder

languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

text = """
Hello
"""
confidence_values = detector.compute_language_confidence_values(text.strip())
for language, value in confidence_values:
    print(f"{language.name}: {value:.2f}")

return spanich (but the correct language is English)

SPANISH: 1.00
ENGLISH: 0.95
FRENCH: 0.87
GERMAN: 0.82

Do you know some tips to have better result for detecting language on user generated content?

pemistahl / lingua-py

Bad detection in common word #94