Closed halfdan closed 7 years ago
Hi, @halfdan, thanks for offering the help!
I saw Tatoeba (maybe because it's known project between esperantists :)).
Whatlang is inspired by Franc JS project, it applies basically the same algorithm and I use trigrams from that library. I haven't integrated yet all possible languages.
At the beginning I thought, I'll do it in once: bang! and I have 160 languages. But it turned out, that every language needs special treatment: there are different scripts, and sometimes, one language may be written in two different scripts (e.g. Latin and Cyrillic). Also I lookup language code and some text samples for the tests at wikipedia.
However, the dataset can be useful. Is it public? How can I access it?:)
Here is the list of trigrams: https://github.com/wooorm/franc/blob/master/lib/data.json
There is nothing actually to do, so I am closing the issue.
I'm a recent contributor to the Tatoeba project and I'm currently working on finding a better language detection library. Given the data collected at Tatoeba it is easy to provide a set of trigrams for about 200 languages (I already have them in a database).
Let me know if you're interested in trying to integrate some more languages by using this dataset. It'd be worth trying to use it and compare it against other libraries / on the original dataset from Tatoeba (~5M sentences).