pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.1k stars 44 forks source link

Language filtering causes wrong results #71

Closed 3a77 closed 1 year ago

3a77 commented 2 years ago

Hi, I think the language filtering that takes place before the n-grams are checked works too aggressively. I've made the observation that one non-German character is sufficient for Lingua to dismiss German as a possible language. Here are a few examples:

Vandalismus in Rotenburg: Bürger unterstützen Cafébesitzer Barça-Fans feiern fünften Saisonsieg Führung der César-Akademie zieht sich zurück Ein gut gekühlter Roséwein Flüchtlingsreferendum in Ungarn: Eigentor für Orbán Charité-Beschäftigte streikten schon mehrfach DFB: Fünf Clásico-Erkenntnisse für Bundestrainer Joachim Löw Der Eröffnungstag des Sónar-Festivals für elektronische Musik gehörte den Instrumentalkünstlern

pemistahl commented 1 year ago

Hi @3a77, thank you for opening this issue. Indeed, the rule engine does not work optimally yet. I will use your examples to improve the algorithm.