Closed yanirs closed 7 years ago
@jprante Any thoughts on this PR? I'm happy to provide more details if necessary. :slightly_smiling_face:
@yanirs sorry for the delay. That's simply marvelous work, I'm very impressed. Thank you for sharing your imrpovments with me and the community!
I have to go into the details for myself in a quiet hour, I'm confident from your excellent pull request that everything will work perfectly.
@jprante Thank you! :smile:
I have released version 5.4.0.2 with the pull request included.
Motivation: This set of changes improves our understanding of how the plugin performs on texts of various lengths and types, and increases the number of supported languages while improving classification accuracy.
Main changes
LangdetectService
that evaluate its classification accuracy (percentage of correctly-classified texts) on various text lengths and types. These tests are instantiated dynamically by theDetectLanguageAccuracyTest
class using a CSV file (src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/accuracies.csv
), which contains a row for each set of parameters with the expected classification accuracy for each language. The tested parameters include:src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/udhr.tsv
), and translations of the WordPress interface (src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/wordpress-translations.tsv
).original-default
,short-text
, or the newly-addedmerged-average
(more on this below).original-default
orall
.scripts/README.md
for running instructions.DetectLanguageAccuracyTest
's CSV input file: When thepath.accuracies.out
system property is set, the test class writes the accuracies to a CSV file, making it easy to update the expected results if they ever change.merged-average
, a new language profile that combines theoriginal-default
andshort-text
profiles by averaging the n-gram frequencies for every language. This language profile supports 55 languages (the union of the 53 languages supported by theoriginal-default
profile and the 47 languages supported by theshort-text
profile), while increasing classification accuracy on the 45 original-default languages (the intersection of the two existing profiles). A comparison of the performance of the different profiles is shown below. Given this comparison, the default settings have been changed to use themerged-average
profile on all 55 languages.LangProfile
to make it immutable and allow it to readlong
integers.Summary of experiments
The following table presents the mean accuracy by dataset, profile, text length (full versus short – texts of length 5, 10 & 20), and language setting. As a reminder, the original number of default languages is 45, while theoriginal-default
,merged-average
, andshort-text
profiles support 53, 55, and 47 languages respectively. Therefore, when testing on all languages, two numbers are reported for theoriginal-default
andshort-text
profiles: The first is the mean accuracy across all 55 languages (including languages they can't get right due to lack of support), and the second is the mean accuracy across only the supported languages (marked with S:). I think that the first number is more in line with the goals of many plugin users, who would want to increase coverage, but can't guarantee that they'd only try to classify texts in supported languages. The second number is provided for completeness.(S: 100%)
(S: 77.94%)
(S: 100%)
(S: 79.95%)
(S: 99.06%)
(S: 72.17%)
(S: 99.66%)
(S: 76.08%)
Manual testing
./gradlew test --rerun-tasks --info
to view the output of the new tests (all tests should pass)../gradlew test --rerun-tasks -Dpath.accuracies.out=accuracies.csv
, and verify that the outputaccuracies.csv
file is identical tosrc/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/accuracies.csv
.