pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.02k stars 43 forks source link

High-confidence false detections on text from webpages that contain many languages #215

Open cuihaoleo opened 6 months ago

cuihaoleo commented 6 months ago

Thank you for the amazing library. I'm using lingua-py 2.0.2 to classify language of crawled webpages. The text is simply extracted by BeautifulSoup using soup.get_text(' ').

I noticed a abnormally high number of Yoruba detection (and maybe some other minor languages I didn't notice). After checking some results, it looks like it's because the webpage contains text of many languages, usually from language selection dropdowns.

This is an example input: lingua-test.txt

Using LanguageDetectorBuilder.from_all_languages(), it recognizes the text as Yoruba with 1.0 confidence.

As a human, I think there is still enough features (the text at the beginning and end) to recognize the webpage as "mainly English".

It's reasonable that lingua-py does not recognize the mix-language text as English, but recognizing it as Yoruba with 100% confidence doesn't look correct at all. I'd like lingua-py at least output a low confidence so I can filter out the problematic text.

>>> text = open('lingua-test.txt', 'r').read()
>>> lang_detector.compute_language_confidence_values(text)
[ConfidenceValue(language=Language.YORUBA, value=1), ConfidenceValue(language=Language.AFRIKAANS, value=0), ConfidenceValue(language=Language.ALBANIAN, value=0), ConfidenceValue(language=Language.ARABIC, value=0), ConfidenceValue(language=Language.ARMENIAN, value=0), ConfidenceValue(language=Language.AZERBAIJANI, value=0), ConfidenceValue(language=Language.BASQUE, value=0), ConfidenceValue(language=Language.BELARUSIAN, value=0), ConfidenceValue(language=Language.BENGALI, value=0), ConfidenceValue(language=Language.BOKMAL, value=0), ConfidenceValue(language=Language.BOSNIAN, value=0), ConfidenceValue(language=Language.BULGARIAN, value=0), ConfidenceValue(language=Language.CATALAN, value=0), ConfidenceValue(language=Language.CHINESE, value=0), ConfidenceValue(language=Language.CROATIAN, value=0), ConfidenceValue(language=Language.CZECH, value=0), ConfidenceValue(language=Language.DANISH, value=0), ConfidenceValue(language=Language.DUTCH, value=0), ConfidenceValue(language=Language.ENGLISH, value=0), ConfidenceValue(language=Language.ESPERANTO, value=0), ConfidenceValue(language=Language.ESTONIAN, value=0), ConfidenceValue(language=Language.FINNISH, value=0), ConfidenceValue(language=Language.FRENCH, value=0), ConfidenceValue(language=Language.GANDA, value=0), ConfidenceValue(language=Language.GEORGIAN, value=0), ConfidenceValue(language=Language.GERMAN, value=0), ConfidenceValue(language=Language.GREEK, value=0), ConfidenceValue(language=Language.GUJARATI, value=0), ConfidenceValue(language=Language.HEBREW, value=0), ConfidenceValue(language=Language.HINDI, value=0), ConfidenceValue(language=Language.HUNGARIAN, value=0), ConfidenceValue(language=Language.ICELANDIC, value=0), ConfidenceValue(language=Language.INDONESIAN, value=0), ConfidenceValue(language=Language.IRISH, value=0), ConfidenceValue(language=Language.ITALIAN, value=0), ConfidenceValue(language=Language.JAPANESE, value=0), ConfidenceValue(language=Language.KAZAKH, value=0), ConfidenceValue(language=Language.KOREAN, value=0), ConfidenceValue(language=Language.LATIN, value=0), ConfidenceValue(language=Language.LATVIAN, value=0), ConfidenceValue(language=Language.LITHUANIAN, value=0), ConfidenceValue(language=Language.MACEDONIAN, value=0), ConfidenceValue(language=Language.MALAY, value=0), ConfidenceValue(language=Language.MAORI, value=0), ConfidenceValue(language=Language.MARATHI, value=0), ConfidenceValue(language=Language.MONGOLIAN, value=0), ConfidenceValue(language=Language.NYNORSK, value=0), ConfidenceValue(language=Language.PERSIAN, value=0), ConfidenceValue(language=Language.POLISH, value=0), ConfidenceValue(language=Language.PORTUGUESE, value=0), ConfidenceValue(language=Language.PUNJABI, value=0), ConfidenceValue(language=Language.ROMANIAN, value=0), ConfidenceValue(language=Language.RUSSIAN, value=0), ConfidenceValue(language=Language.SERBIAN, value=0), ConfidenceValue(language=Language.SHONA, value=0), ConfidenceValue(language=Language.SLOVAK, value=0), ConfidenceValue(language=Language.SLOVENE, value=0), ConfidenceValue(language=Language.SOMALI, value=0), ConfidenceValue(language=Language.SOTHO, value=0), ConfidenceValue(language=Language.SPANISH, value=0), ConfidenceValue(language=Language.SWAHILI, value=0), ConfidenceValue(language=Language.SWEDISH, value=0), ConfidenceValue(language=Language.TAGALOG, value=0), ConfidenceValue(language=Language.TAMIL, value=0), ConfidenceValue(language=Language.TELUGU, value=0), ConfidenceValue(language=Language.THAI, value=0), ConfidenceValue(language=Language.TSONGA, value=0), ConfidenceValue(language=Language.TSWANA, value=0), ConfidenceValue(language=Language.TURKISH, value=0), ConfidenceValue(language=Language.UKRAINIAN, value=0), ConfidenceValue(language=Language.URDU, value=0), ConfidenceValue(language=Language.VIETNAMESE, value=0), ConfidenceValue(language=Language.WELSH, value=0), ConfidenceValue(language=Language.XHOSA, value=0), ConfidenceValue(language=Language.ZULU, value=0)]
yudelevi commented 6 days ago

Yoruba seems to be a catchall language, I ended up treating all Yoruba + 100% confidence as low confidence