optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
568 stars 165 forks source link

Japanese detection is not good #86

Open afjlee opened 6 years ago

afjlee commented 6 years ago

"案ソリレコ藤将ヤ崎47問カオヲヱ埼7関幅えめラな免像たふッ模合むなイレ量版はラほ県組リ右具ヱニエ代切ぜ。向テスラカ鑑表ハ評業でが身味うレ女経チコモ高送福渋うぴンラ疑3金すけやざ一芸ユ北社セキネ南過ロ固高トさ事奥ぽに。嗅末必はぼを仕技ウテヱナ誌文ル余給にー企32杯ば中農ぐ展演ワユソ藤見れじスご上止氏援ソトハ健宿よ" is detected as Chinese-TW. "おはようございます" is detected as no match.

Can detection combine charset range with n-gram algorithm?

james-s-w-clark commented 4 years ago

Similar to #85 , just checking Lingua's performance here (#107):

detected = {Language@8430} "CHINESE"
 isoCode639_1 = {IsoCode639_1@10047} "zh"
 isoCode639_3 = {IsoCode639_3@10048} "zho"
 alphabets = {Collections$SingletonSet@10049}  size = 1
 uniqueCharacters = ""
 name = "CHINESE"

@pemistahl Lingua's description notes that it checks text for unique characters that only exist in one script, before deciding to use ngram models.

I see that the JAPANESE language enum has no unique chars (the last field in enum below): JAPANESE (JA, JPN, setOf(HIRAGANA, KATAKANA, HAN), ""), However, I think HIRAGANA & KATAKANA are unique to Japanese - so shouldn't this text be picked up as Japanese (and without using ngram models)?

pemistahl commented 4 years ago

@IdiosApps I've fixed the Japanese detection problem. Maybe you want to try it again.

james-s-w-clark commented 4 years ago

@pemistahl Lingua readme still shows version as 0.6.1, has there been a new build pushed?

pemistahl commented 4 years ago

@IdiosApps No, you have to build from source yourself if you want to test the changes.

james-s-w-clark commented 4 years ago

Found a huge accuracy improvement/bug for Optimaize concerning:

I'll explain with example code and a result first, then explain why this works.

 String jaText = "コンコルド001試作機は1969年3月2日にトゥールーズで初飛行した";
        StringBuilder jaTextNormalised = new StringBuilder();
        jaText.chars()
                .mapToObj(c -> (char) c)
                .map(CharNormalizer::normalize)
                .forEach(jaTextNormalised::append);

        List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn();
        LanguageDetector detector = LanguageDetectorBuilder.create(NgramExtractors.standard())
                .withProfiles(languageProfiles)
                .build();

        List<DetectedLanguage> detectedLanguages = detector.getProbabilities(jaText);
        List<DetectedLanguage> detectedLanguagesNormalised = detector.getProbabilities(jaTextNormalised);

This gives results:

detectedLanguages = {ArrayList@1331}  size = 2
 0 = {DetectedLanguage@1341} "DetectedLanguage[zh-TW:0.8481627026465732]"
 1 = {DetectedLanguage@1342} "DetectedLanguage[ja:0.14799680901578538]"
detectedLanguagesNormalised = {ArrayList@1332}  size = 1
 0 = {DetectedLanguage@1336} "DetectedLanguage[ja:1.0]"

Look at that jump from 15% to absolute confidence!


I noticed that the Japanese profile only had 1 katakana ア and 1 hiragana あ (out of about 20-30, some of which should be very frequent). The file was also quite short, which was unexpected as there are several thousand kanji to make ngrams with.

So, assuming the model was bad I tried to create a new model for Japanese. I noticed that it was munging my text:

コンコルド001試作機は1969年3月2日にトゥールーズで初飛行した
->
アアアアア   並三並あ    年 月 日あアアアアアアあ之並三ああ

I stepped through in the debugger and got to CharNormalizer.normalize(). Here you can see compression of e.g. all katakana to ア, and blocks of kanji to the first kanji in the block. Presumably this is to prevent the models from being so large (114,000 ngrams for 70 languages - could be a lot bigger if all cjk chars were used).

I moved from creating a language profile to testing language detection. It adjusts net language probabilities based on the language probabilities for each ngram. However: our un-normalized detection text had very few entries in the 114,000 ngram map. This is because they were never put there during model creation (due to the compression). If you manually normalize (e.g. in my example code above), the detector actually finds the (normalized) entries in the ngram map and can adjust language probabilities properly.

I considered 1. creating a new ja profile (very large?), and 2. creating an Optimaize build which normalizes input properly (this is a fixable bug - but project seems stale and PRs ignored; I'm also not familiar with pushing releases to maven for gradle and all that)... but I think the simplest solution is the manually normalise input.

Manually normalizing input should see general CJK accuracy increase. I doubt it will decrease accuracy - we are just formatting our input according to the same rules that the models were made by.

james-s-w-clark commented 4 years ago

@afjlee for your input text, here are my Optimaize results (using the code snippet above):

detectedLanguages = {ArrayList@1322}  size = 1
 0 = {DetectedLanguage@1327} "DetectedLanguage[zh-TW:0.5219696306608081]"
detectedLanguagesNormalised = {ArrayList@1323}  size = 1
 0 = {DetectedLanguage@1330} "DetectedLanguage[ja:0.9999996289775959]"
afjlee commented 4 years ago

thanks All!!

james-s-w-clark commented 4 years ago

Actually, tested on some strings for #63 and normalizing detection input according to profile creation normalization appears to have a large negative impact on accuracy in some cases.

Need to debug more