greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/
https://whatlang.org/
MIT License
966 stars 108 forks source link

Slovak language support #52

Closed valeriansaliou closed 4 years ago

valeriansaliou commented 4 years ago

Hello there!

Using whatlang as part of sonic language detection system. It works great overall, thanks a lot for your work, and for adding Latin recently, which has been implemented in sonic.

I've got an user on my end requesting Slovak to be added to sonic. Do you think this is something possible from whatlang, is there any reason it's not there (I see that Slovene is supported, while Slovak is not there).

Ref: https://github.com/valeriansaliou/sonic/issues/178

valeriansaliou commented 4 years ago

Note that I'll be happy PR-ing this myself, if there is no blocking reason as to why Slovak has not been implemented from now.

greyblake commented 4 years ago

s there any reason it's not there (I see that Slovene is supported, while Slovak is not there).

If I remember correctly I just tried to implement the most popular languages by number of native speakers using a list in wikipedia. Probably Slovak was not in the list.

They are few reasons, why I've decided not to add every language possible:

Shortly, the languages we got implemented in whatlang was a reasonable pragmatic trade-off. In most cases I would be OK to add a new language on demand if someone has real needs and requests it.

Btw, I just added Slovak in whatlang 0.9.0.

Thank you for using whatlang. I've implemented the library just for fun, but you sonic search engine brings it to a real practical use :)

valeriansaliou commented 4 years ago

Hello @greyblake

Thank you so much for the quick answer and release, really appreciated!

Slovak support has just been added to Sonic: https://github.com/valeriansaliou/sonic/commit/19412ce05a802ef1e6054b751faaef50cab5d36b

On the reasons as to why not all languages are available, I completely understand.

The main problem is mostly about so many different European languages sharing the same Latin script, there would probably be an optimization path where you'd add a pre-detection pass after Latin is detected as an alphabet, where you'd restrict even further the language list by accented characters. Eg. "ē" appears in Latvian (and possibly other Baltic languages), but definitely does not occur in French (though, it's not as straightforward, as a Latvian sentence may not contain any accented character that characterizes a Baltic language, so there need to be a fallback to avoid such false negatives).

While, Cyrillic, Arabic and Mandarin, Kanji, etc. scripts do not have this performance hit issue.

Thanks again! Valerian.

greyblake commented 4 years ago

@valeriansaliou Thanks for the suggestion. I also had similar idea in mind and even implemented similar thing years ago in Smartdict project.

However this approach becomes trickier considering that text in one language, may include words from another language. E.g. German alphabet does not have é. But french word Exposé is widely used in the modern German.

valeriansaliou commented 4 years ago

Ah, snap yes, indeed. I understand then, trickier than it seems w/ modern language usage.