Massive refactoring - Githubissues

AndrewDryga commented 5 years ago

Hello @minibikini,

Thank you for amazing work on this library!

I spent a few hours and overhauled it with updated script language expressions and trigrams, cleaned up tests (from magic and used more complete test suite ported from Franc), resurrected benchmarks, make it look more Elixir-ish and added some features for our common use cases.

I understand that change is huge and maybe not something you want to maintain, so closing PR is totally fine. So.. would you like to merge it or not? :)

Closes #10.

coveralls commented 5 years ago

Coverage increased (+9.09%) to 100.0% when pulling a56da27c38eb85c8f9409ae9b59f15c59a106f7e on AndrewDryga:master into babf1eb8f60e1804b91765788b2f79d9430a7bd5 on minibikini:master.

minibikini commented 5 years ago

Hi @AndrewDryga, thanks a lot! I'll review and merge as soon as I can.

AndrewDryga commented 5 years ago

@minibikini we noticed that detection is not accurate for small messages, I'm now thinking about how to improve that:

Paasaa.list_language_probabilities("Hello, how are you?", only: ["eng", "spa", "fra"], min_length: 4)
[{"fra", 1.0}, {"eng", 0.9246519246519247}, {"spa", 0.6404586404586405}]

Paasaa.list_language_probabilities("Hello, how are you?", min_length: 4)
[
  {"sot", 1.0},
  {"fuf", 0.9826407154129405},
  {"nso", 0.8395581273014203},
  {"ita", 0.793792740662809},
  {"hat", 0.7743293003682272},
  {"fuv", 0.7727511835875855},
  {"ron", 0.7601262493424513},
  ...
}

I found a few papers and projects that use Twitter data for language detection (which much better fits recognition for smaller messages):

https://github.com/shuyo/language-detection
https://github.com/shuyo/ldig
https://arxiv.org/pdf/1608.08515.pdf
https://github.com/dachev/node-cld
https://github.com/google/cld3 - additionally uses unigrams and bigrams, and trained neural network on top of them;
https://github.com/dannote/tongue

The last one also makes stupid mistakes:

iex(10)> Tongue.detect("Hola", subset)
[en: 0.9999974208734205]
iex(11)> Tongue.detect("Hello", subset)
[en: 0.99999532419546]

So adding small dictionary and use it to update weights may be beneficial at least for small cases.

minibikini commented 4 years ago

@AndrewDryga, I'm going to close it for now, feel free to reopen when you have time to continue.

AndrewDryga commented 4 years ago

Sorry for a late reply, work shifted focus. I'm not 100% sure that we can make a PR that would make Paasaa good enough on our data as I've described in the comment above, but we going to look into it soon. Because it can end up being something very different we might just use our modified fork.

minibikini / paasaa

Massive refactoring #11