pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.08k stars 44 forks source link

English is detected as Sotho too often #158

Closed iwtu closed 11 months ago

iwtu commented 12 months ago

English language is detected so often, then if I use the rule: replace Sotho with English, it will work.

pemistahl commented 11 months ago

Do you build the language detector from all available languages? Please give me concrete examples for which my library produces wrong results. Otherwise, your issue is no help for me to improve the algorithm. Thank you.

iwtu commented 11 months ago

code: langue_detector = LanguageDetectorBuilder.from_all_spoken_languages().build()

samples, which was detected as Sotho.

pemistahl commented 11 months ago

I cannot take your misspelled examples seriously. If you spell them correctly, the identified language is English. As for the others, the sum of the ngram probabilities for Sotho is simply larger than the sum of the ngram probabilities for English. This is not a bug, this is how the Naive Bayes language model works. You will never reach 100 % accuracy, no matter which language detector you choose. That's why I close this issue now.

iwtu commented 11 months ago

well, there real user's inputs. It's not all about Naive Bayes language model. There is also reality element :) Maybe some check/correct for typos?

pemistahl commented 11 months ago

This is a language identification library, not a spell-checking library. Go and use an additional spell checker if you need one.