Closed OmriPi closed 1 year ago
Hi @OmriPi, thank you for your question. The Python implementation of Lingua does not yet support this but the Go implementation does. There is issue #86 already which I will implement this feature for (starting soon). As soon as this is implemented, you best build the detector from all languages and then compute the confidence value for English only. If your text is below a certain threshold (say 0.5) you can classify it as non-English. But it's wise to play around with different threshold values.
@OmriPi I've just released Lingua 1.2.0 which allows to do what you want. Perhaps you want to try it out again. It's best to build the detector from all languages because the detector still compares all language confidence values with each other which provides for a more realistic confidence value for English.
detector = LanguageDetectorBuilder.from_all_languages().build()
detector.compute_language_confidence("some text", Language.ENGLISH)
If you have any feedback, please let me know. Thanks.
Hi, I'm currently working on a project which requires me to filter all non-English text. It is comprised of mostly short texts, most of them in English. I thought of building the language detector with only
Language.ENGLISH
but got an error that at least two languages are required. I do not care about knowing what language each non-English text is actually in, only English / Non-English. What would be the correct way to go about it with lingua? I think it might be problematic if I set it to recognize all languages because it might just add unnecessary noise to the prediction, which should have a bias towards English in my case. Thanks!