English is detected as Sotho too often

pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text

Apache License 2.0

1.08k stars 44 forks source link

English is detected as Sotho too often #158

Closed iwtu closed 11 months ago

iwtu commented 12 months ago

English language is detected so often, then if I use the rule: replace Sotho with English, it will work.

pemistahl commented 11 months ago

Do you build the language detector from all available languages? Please give me concrete examples for which my library produces wrong results. Otherwise, your issue is no help for me to improve the algorithm. Thank you.

iwtu commented 11 months ago

code: langue_detector = LanguageDetectorBuilder.from_all_spoken_languages().build()

samples, which was detected as Sotho.

Can I use my credits?
Hello
i cannot make chcek in
I pay tikets
Hello whats going on
hello how can i get more bagage
thats it thank you
Tell the promocode
I want more kelo
Rephrase
can y cancel the tikets?

pemistahl commented 11 months ago

I cannot take your misspelled examples seriously. If you spell them correctly, the identified language is English. As for the others, the sum of the ngram probabilities for Sotho is simply larger than the sum of the ngram probabilities for English. This is not a bug, this is how the Naive Bayes language model works. You will never reach 100 % accuracy, no matter which language detector you choose. That's why I close this issue now.

iwtu commented 11 months ago

well, there real user's inputs. It's not all about Naive Bayes language model. There is also reality element :) Maybe some check/correct for typos?

pemistahl commented 11 months ago

This is a language identification library, not a spell-checking library. Go and use an additional spell checker if you need one.