pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.14k stars 45 forks source link

Is it possible to detect only English using lingua? #92

Closed OmriPi closed 1 year ago

OmriPi commented 1 year ago

Hi, I'm currently working on a project which requires me to filter all non-English text. It is comprised of mostly short texts, most of them in English. I thought of building the language detector with only Language.ENGLISH but got an error that at least two languages are required. I do not care about knowing what language each non-English text is actually in, only English / Non-English. What would be the correct way to go about it with lingua? I think it might be problematic if I set it to recognize all languages because it might just add unnecessary noise to the prediction, which should have a bias towards English in my case. Thanks!

pemistahl commented 1 year ago

Hi @OmriPi, thank you for your question. The Python implementation of Lingua does not yet support this but the Go implementation does. There is issue #86 already which I will implement this feature for (starting soon). As soon as this is implemented, you best build the detector from all languages and then compute the confidence value for English only. If your text is below a certain threshold (say 0.5) you can classify it as non-English. But it's wise to play around with different threshold values.

pemistahl commented 1 year ago

@OmriPi I've just released Lingua 1.2.0 which allows to do what you want. Perhaps you want to try it out again. It's best to build the detector from all languages because the detector still compares all language confidence values with each other which provides for a more realistic confidence value for English.

detector = LanguageDetectorBuilder.from_all_languages().build()
detector.compute_language_confidence("some text", Language.ENGLISH)

If you have any feedback, please let me know. Thanks.