Closed darccio closed 4 years ago
Hi @imdario.
File with trigrams base is here: https://github.com/greyblake/whatlang-rs/blob/master/misc/data.json#L62 (Catalan is also there). Whatlang inherited trigrams from Franc.
I don't know Catalan, but I assume it is very similar to Spanish in many ways. You may not work well on short texts to distinguish Catalan from Spanish with sufficient confidence.
If you'd like me to add Catalan to be supported by Whatlang, just let me know, I'll do it.
@greyblake I missed this file! Thank you, I'll implement it myself.
About texts, I tried it with some texts that are impossible in Spanish, because in Catalan we have apostrophes, slashes, grave accents, etc. but no luck. I will report back with the trigrams that you linked.
Apostrophes and slashes are not used for trigrams (as well as other punctuations and digits).
It works! I close this issue and open the PR.
I would like to add support for my mother tongue, Catalan, to whatlang-rs. If I'm able to do that, I can use it in sonic for an idea I'm testing.
I've reviewed the PR #53 and the related issue #52 and it seems simple. But I've scouted the Internet for n-grams for Catalan, found some, and tried them in a fork. Unfortunately, the best result I got had a confidence of 0,76. I even tried to generate my own n-grams set from a big corpus like softcatala/ca-text-corpus.
So, my question: where can I get a good n-grams set? Where did the other languages' sets come from?