Failed to predict correct language for popular English single words

pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text

Apache License 2.0

1.1k stars 44 forks source link

Failed to predict correct language for popular English single words #97

Closed Alex-Kopylov closed 1 year ago

Alex-Kopylov commented 1 year ago

Hello

"ITALIAN": 0.9900000000000001,
"SPANISH": 0.8457074930316446,
"ENGLISH": 0.6405700388041755,
"FRENCH": 0.260556921899765,
"GERMAN": 0.01,
"CHINESE": 0,
"RUSSIAN": 0

Bye

"FRENCH": 0.9899999999999999,
"ENGLISH": 0.9062076381164255,
"GERMAN": 0.6259792361883574,
"SPANISH": 0.46755135335558035,
"ITALIAN": 0.01,
"CHINESE": 0,
"RUSSIAN": 0

Loss (not Löss)

"GERMAN": 0.99,
"ENGLISH": 0.9177028091362562,
"ITALIAN": 0.9082690119891484,
"FRENCH": 0.7091301303929289,
"SPANISH": 0.01,
"CHINESE": 0,
"RUSSIAN": 0

Alex-Kopylov commented 1 year ago

Btw, I've only briefly checked existence of these words in Italian, French and German languages. But anyway, basing on spread of these words, I assume that English variant should be on the first place in prediction. Please correct me If I'm wrong.

pemistahl commented 1 year ago

Pure statistical approaches to language detection are never 100% correct. The letter sequences in your examples are not only common in English, but even more common in Italian or French. That's why the probabilities for Italian and French are higher than the probability for English.

Feed longer strings into the detector. Then you will get more reliable results.