jprante / elasticsearch-langdetect

A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector
Apache License 2.0
251 stars 46 forks source link

Improve classification accuracy and coverage by merging profiles #69

Closed yanirs closed 7 years ago

yanirs commented 7 years ago

Motivation: This set of changes improves our understanding of how the plugin performs on texts of various lengths and types, and increases the number of supported languages while improving classification accuracy.

Main changes

Summary of experiments

The following table presents the mean accuracy by dataset, profile, text length (full versus short – texts of length 5, 10 & 20), and language setting. As a reminder, the original number of default languages is 45, while the original-default, merged-average, and short-text profiles support 53, 55, and 47 languages respectively. Therefore, when testing on all languages, two numbers are reported for the original-default and short-text profiles: The first is the mean accuracy across all 55 languages (including languages they can't get right due to lack of support), and the second is the mean accuracy across only the supported languages (marked with S:). I think that the first number is more in line with the goals of many plugin users, who would want to increase coverage, but can't guarantee that they'd only try to classify texts in supported languages. The second number is provided for completeness.
Dataset Profile Full texts; all 55 languages Full texts; original-default 45 languages Short texts; all 55 languages Short texts; original-default 45 languages
udhr original-default 96.36%
(S: 100%)
100% 75.11%
(S: 77.94%)
81.11%
merged-average 100% 100% 77.76% 81.39%
short-text 85.45%
(S: 100%)
100% 68.32%
(S: 79.95%)
80.41%
wordpress-translations original-default 95.45%
(S: 99.06%)
98.93% 69.55%
(S: 72.17%)
74.95%
merged-average 99.60% 99.69% 73.43% 77.25%
short-text 85.16%
(S: 99.66%)
99.64% 65.01%
(S: 76.08%)
76.55%

Manual testing

yanirs commented 7 years ago

@jprante Any thoughts on this PR? I'm happy to provide more details if necessary. :slightly_smiling_face:

jprante commented 7 years ago

@yanirs sorry for the delay. That's simply marvelous work, I'm very impressed. Thank you for sharing your imrpovments with me and the community!

I have to go into the details for myself in a quiet hour, I'm confident from your excellent pull request that everything will work perfectly.

yanirs commented 7 years ago

@jprante Thank you! :smile:

jprante commented 7 years ago

I have released version 5.4.0.2 with the pull request included.