pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.1k stars 44 forks source link

Improve single language detection when words in other languages are quoted #112

Open schrmh opened 1 year ago

schrmh commented 1 year ago

When I put in german sentences with japanese words quoted then it might happen, that lingua claims it's 100% japanese. For example: Wir stoßen an: "かんぱい". Er lächelte. (in english, if you are interested: »We toasted: "kanpai". He smiled«) leads to a ConfidenceValue of 1.0 of japanese. While Wir stoßen an. Er lächelte. has a ConfidenceValue of 0.6014287047855706 for german and 0.0 for japanese (I included all languages for detection).

The expected result in both should be german, maybe with slight japanese confidence in the first case since a japanese word is quoted but it should not be 100% japanese.

pemistahl commented 1 year ago

Thanks for reaching out to me. I will try to improve language detection for inputs like yours, even though it's not a trivial problem to solve.

datatalking commented 1 year ago

@pemistahl If you could point me in the general area I could look at a few options to test adding this feature.