Closed AndrewDryga closed 4 years ago
Hi @AndrewDryga, thanks a lot! I'll review and merge as soon as I can.
@minibikini we noticed that detection is not accurate for small messages, I'm now thinking about how to improve that:
Paasaa.list_language_probabilities("Hello, how are you?", only: ["eng", "spa", "fra"], min_length: 4)
[{"fra", 1.0}, {"eng", 0.9246519246519247}, {"spa", 0.6404586404586405}]
Paasaa.list_language_probabilities("Hello, how are you?", min_length: 4)
[
{"sot", 1.0},
{"fuf", 0.9826407154129405},
{"nso", 0.8395581273014203},
{"ita", 0.793792740662809},
{"hat", 0.7743293003682272},
{"fuv", 0.7727511835875855},
{"ron", 0.7601262493424513},
...
}
I found a few papers and projects that use Twitter data for language detection (which much better fits recognition for smaller messages):
The last one also makes stupid mistakes:
iex(10)> Tongue.detect("Hola", subset)
[en: 0.9999974208734205]
iex(11)> Tongue.detect("Hello", subset)
[en: 0.99999532419546]
So adding small dictionary and use it to update weights may be beneficial at least for small cases.
@AndrewDryga, I'm going to close it for now, feel free to reopen when you have time to continue.
Sorry for a late reply, work shifted focus. I'm not 100% sure that we can make a PR that would make Paasaa good enough on our data as I've described in the comment above, but we going to look into it soon. Because it can end up being something very different we might just use our modified fork.
Hello @minibikini,
Thank you for amazing work on this library!
I spent a few hours and overhauled it with updated script language expressions and trigrams, cleaned up tests (from magic and used more complete test suite ported from Franc), resurrected benchmarks, make it look more Elixir-ish and added some features for our common use cases.
I understand that change is huge and maybe not something you want to maintain, so closing PR is totally fine. So.. would you like to merge it or not? :)
Closes #10.