optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
567 stars 165 forks source link

TextObjectFactory changes text #97

Open danielnaber opened 5 years ago

danielnaber commented 5 years ago

This test fails:

    TextObjectFactory textObjectFactory  = new TextObjectFactoryBuilder().maxTextLength(1000).build();
    String inp = "一体日本人は生きるということを知っているだろうか。";
    String shortText = textObjectFactory.forText(inp).toString();
    assertEquals(inp, shortText);

Output:

org.junit.ComparisonFailure: 
Expected :一体日本人は生きるということを知っているだろうか。
Actual   :一万日三人あ三ああああああああ之ああああああああ。

I guess the issue is in TextObject.append().

danielnaber commented 4 years ago

I don't understand why com.optimaize.langdetect.cybozu.util.CharNormalizer#normalize does this:

        } else if (block == Character.UnicodeBlock.HIRAGANA) {
            ch = '\u3042';

i.e. all characters of a Unicode block are mapped to a single character?

james-s-w-clark commented 4 years ago

Hi @danielnaber , please take a look at https://github.com/optimaize/language-detector/issues/86#issuecomment-638818158

Basically: I think it's because hiragana/katakana are unique to Japanese (and similar for Hangul symbols being unique Korean, etc.), so it's to try and compress the models. I expect that the "compressed" models perform similarly to full models, but this is a guess that's not backed by data!

It seems that the action to take is to manually normalise your input text for detection (so the munged text actually finds matches in the big ngram map).