Thank you for the amazing library. I'm using lingua-py 2.0.2 to classify language of crawled webpages. The text is simply extracted by BeautifulSoup using soup.get_text(' ').
I noticed a abnormally high number of Yoruba detection (and maybe some other minor languages I didn't notice). After checking some results, it looks like it's because the webpage contains text of many languages, usually from language selection dropdowns.
Using LanguageDetectorBuilder.from_all_languages(), it recognizes the text as Yoruba with 1.0 confidence.
As a human, I think there is still enough features (the text at the beginning and end) to recognize the webpage as "mainly English".
It's reasonable that lingua-py does not recognize the mix-language text as English, but recognizing it as Yoruba with 100% confidence doesn't look correct at all. I'd like lingua-py at least output a low confidence so I can filter out the problematic text.
Thank you for the amazing library. I'm using lingua-py 2.0.2 to classify language of crawled webpages. The text is simply extracted by BeautifulSoup using
soup.get_text(' ')
.I noticed a abnormally high number of Yoruba detection (and maybe some other minor languages I didn't notice). After checking some results, it looks like it's because the webpage contains text of many languages, usually from language selection dropdowns.
This is an example input: lingua-test.txt
Using
LanguageDetectorBuilder.from_all_languages()
, it recognizes the text as Yoruba with 1.0 confidence.As a human, I think there is still enough features (the text at the beginning and end) to recognize the webpage as "mainly English".
It's reasonable that lingua-py does not recognize the mix-language text as English, but recognizing it as Yoruba with 100% confidence doesn't look correct at all. I'd like lingua-py at least output a low confidence so I can filter out the problematic text.