pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.1k stars 44 forks source link

Add absolute confidence metric #235

Open pemistahl opened 1 month ago

pemistahl commented 1 month ago

Currently, the library only provides a relative confidence metric that tells you how likely a language is in comparison to another language. It is desirable to have an additional absolute confidence metric that works with a single language only and independently from any other language. With such an absolute confidence metric, a LanguageDetector instance could be built from a single language. This instance would then be able to provide binary decisions, i.e. tell whether some text is written in a specific language or not.

An absolute confidence metric could be based on unique or the most common n ngrams of a language.