komodojp / tinyld

Simple and Performant Language detection library for NodeJS
https://komodojp.github.io/tinyld/
MIT License
415 stars 12 forks source link

"hello" is 0.62% English #28

Closed winrid closed 4 months ago

winrid commented 4 months ago

As title says, even with TinyLD Heavy, the word "hello" is only 0.62% English...

kefniark commented 4 months ago

Yes and this is by design, because it use a statistical approach, it needs a certain amount of characters ~40 to work with. It cannot work with one or two word and never will be. https://github.com/komodojp/tinyld/blob/develop/docs/faq.md#can-tinyld-identify-short-strings

I guess for the moment your best hope is that someone make some good AI for language detection.

winrid commented 4 months ago

I see. I figured you had compressed dictionaries of common words or some such... I will just do this myself. I need to determine language from just one word sometimes, as it's used to implement a language whitelist.