Recommended way to get good n-grams for any language

greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/

https://whatlang.org/

MIT License

969 stars 109 forks source link

Recommended way to get good n-grams for any language #55

Closed darccio closed 4 years ago

darccio commented 4 years ago

I would like to add support for my mother tongue, Catalan, to whatlang-rs. If I'm able to do that, I can use it in sonic for an idea I'm testing.

I've reviewed the PR #53 and the related issue #52 and it seems simple. But I've scouted the Internet for n-grams for Catalan, found some, and tried them in a fork. Unfortunately, the best result I got had a confidence of 0,76. I even tried to generate my own n-grams set from a big corpus like softcatala/ca-text-corpus.

So, my question: where can I get a good n-grams set? Where did the other languages' sets come from?

greyblake commented 4 years ago

Hi @imdario.

File with trigrams base is here: https://github.com/greyblake/whatlang-rs/blob/master/misc/data.json#L62 (Catalan is also there). Whatlang inherited trigrams from Franc.

I don't know Catalan, but I assume it is very similar to Spanish in many ways. You may not work well on short texts to distinguish Catalan from Spanish with sufficient confidence.

If you'd like me to add Catalan to be supported by Whatlang, just let me know, I'll do it.

darccio commented 4 years ago

@greyblake I missed this file! Thank you, I'll implement it myself.

About texts, I tried it with some texts that are impossible in Spanish, because in Catalan we have apostrophes, slashes, grave accents, etc. but no luck. I will report back with the trigrams that you linked.

greyblake commented 4 years ago

Apostrophes and slashes are not used for trigrams (as well as other punctuations and digits).

darccio commented 4 years ago

It works! I close this issue and open the PR.