greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/
https://whatlang.org/
MIT License
969 stars 109 forks source link

Contribute new languages #1

Closed halfdan closed 7 years ago

halfdan commented 7 years ago

I'm a recent contributor to the Tatoeba project and I'm currently working on finding a better language detection library. Given the data collected at Tatoeba it is easy to provide a set of trigrams for about 200 languages (I already have them in a database).

Let me know if you're interested in trying to integrate some more languages by using this dataset. It'd be worth trying to use it and compare it against other libraries / on the original dataset from Tatoeba (~5M sentences).

greyblake commented 7 years ago

Hi, @halfdan, thanks for offering the help!

I saw Tatoeba (maybe because it's known project between esperantists :)).

Whatlang is inspired by Franc JS project, it applies basically the same algorithm and I use trigrams from that library. I haven't integrated yet all possible languages.

At the beginning I thought, I'll do it in once: bang! and I have 160 languages. But it turned out, that every language needs special treatment: there are different scripts, and sometimes, one language may be written in two different scripts (e.g. Latin and Cyrillic). Also I lookup language code and some text samples for the tests at wikipedia.

However, the dataset can be useful. Is it public? How can I access it?:)

greyblake commented 7 years ago

Here is the list of trigrams: https://github.com/wooorm/franc/blob/master/lib/data.json

greyblake commented 7 years ago

There is nothing actually to do, so I am closing the issue.