Improve classification accuracy and coverage by merging profiles

yanirs commented 7 years ago

Motivation: This set of changes improves our understanding of how the plugin performs on texts of various lengths and types, and increases the number of supported languages while improving classification accuracy.

Main changes

Add tests for LangdetectService that evaluate its classification accuracy (percentage of correctly-classified texts) on various text lengths and types. These tests are instantiated dynamically by the DetectLanguageAccuracyTest class using a CSV file (src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/accuracies.csv), which contains a row for each set of parameters with the expected classification accuracy for each language. The tested parameters include:
- Text length, with emphasis on short texts in the 5-20 characters range (simulating search queries).
- Text type, represented by two new datasets that cover all the languages supported by the plugin: translations of the Universal Declaration of Human Rights (src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/udhr.tsv), and translations of the WordPress interface (src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/wordpress-translations.tsv).
- Language profile: original-default, short-text, or the newly-added merged-average (more on this below).
- Supported languages: original-default or all.
Add the Python code used to generate the datasets. See scripts/README.md for running instructions.
Add an option to regenerate DetectLanguageAccuracyTest's CSV input file: When the path.accuracies.out system property is set, the test class writes the accuracies to a CSV file, making it easy to update the expected results if they ever change.
Add merged-average, a new language profile that combines the original-default and short-text profiles by averaging the n-gram frequencies for every language. This language profile supports 55 languages (the union of the 53 languages supported by the original-default profile and the 47 languages supported by the short-text profile), while increasing classification accuracy on the 45 original-default languages (the intersection of the two existing profiles). A comparison of the performance of the different profiles is shown below. Given this comparison, the default settings have been changed to use the merged-average profile on all 55 languages.
Refactor LangProfile to make it immutable and allow it to read long integers.
Normalise Romanian and Vietnamese characters, as done by Shuyo's original library. According to tests on the UDHR and WordPress translations datasets, this improves classification accuracy on these languages, and doesn't affect performance on other languages. Original code for reference:
- Romanian: code, test
- Vietnamese: code, test

Summary of experiments

The following table presents the mean accuracy by dataset, profile, text length (full versus short – texts of length 5, 10 & 20), and language setting. As a reminder, the original number of default languages is 45, while the original-default, merged-average, and short-text profiles support 53, 55, and 47 languages respectively. Therefore, when testing on all languages, two numbers are reported for the original-default and short-text profiles: The first is the mean accuracy across all 55 languages (including languages they can't get right due to lack of support), and the second is the mean accuracy across only the supported languages (marked with S:). I think that the first number is more in line with the goals of many plugin users, who would want to increase coverage, but can't guarantee that they'd only try to classify texts in supported languages. The second number is provided for completeness.

Dataset	Profile	Full texts; all 55 languages	Full texts; original-default 45 languages	Short texts; all 55 languages	Short texts; original-default 45 languages
udhr	original-default	96.36% (S: 100%)	100%	75.11% (S: 77.94%)	81.11%
	merged-average	100%	100%	77.76%	81.39%
	short-text	85.45% (S: 100%)	100%	68.32% (S: 79.95%)	80.41%
wordpress-translations	original-default	95.45% (S: 99.06%)	98.93%	69.55% (S: 72.17%)	74.95%
	merged-average	99.60%	99.69%	73.43%	77.25%
	short-text	85.16% (S: 99.66%)	99.64%	65.01% (S: 76.08%)	76.55%

Manual testing

Run ./gradlew test --rerun-tasks --info to view the output of the new tests (all tests should pass).
Run ./gradlew test --rerun-tasks -Dpath.accuracies.out=accuracies.csv, and verify that the output accuracies.csv file is identical to src/test/resources/org/xbib/elasticsearch/index/mapper/langdetect/accuracies.csv.

yanirs commented 7 years ago

@jprante Any thoughts on this PR? I'm happy to provide more details if necessary. :slightly_smiling_face:

jprante commented 7 years ago

@yanirs sorry for the delay. That's simply marvelous work, I'm very impressed. Thank you for sharing your imrpovments with me and the community!

I have to go into the details for myself in a quiet hour, I'm confident from your excellent pull request that everything will work perfectly.

yanirs commented 7 years ago

@jprante Thank you! :smile:

jprante commented 7 years ago

I have released version 5.4.0.2 with the pull request included.

jprante / elasticsearch-langdetect