Open danielnaber opened 5 years ago
I don't understand why com.optimaize.langdetect.cybozu.util.CharNormalizer#normalize
does this:
} else if (block == Character.UnicodeBlock.HIRAGANA) {
ch = '\u3042';
i.e. all characters of a Unicode block are mapped to a single character?
Hi @danielnaber , please take a look at https://github.com/optimaize/language-detector/issues/86#issuecomment-638818158
Basically: I think it's because hiragana/katakana are unique to Japanese (and similar for Hangul symbols being unique Korean, etc.), so it's to try and compress the models. I expect that the "compressed" models perform similarly to full models, but this is a guess that's not backed by data!
It seems that the action to take is to manually normalise your input text for detection (so the munged text actually finds matches in the big ngram map).
This test fails:
Output:
I guess the issue is in
TextObject.append()
.