Open mikemccand opened 5 years ago
I wonder why you think that this is an issue. Punctuations are removed by default so this is only an issue if you want to use the Korean number filter ?
[Legacy Jira: Jim Ferenczi (@jimczi) on Sep 13 2019]
Sorry for late reply. @jimczi :(
First, I'll modify this issue from Bug to Improvement because it is ambiguous to see it as a bug.
I wonder why you think that this is an issue. Punctuations are removed by default so this is only an issue if you want to use the Korean number filter ?
As you said, the biggest purpose is KoreanNumberFilter. However, users can simply use discardPunctuation option of KoreanTokenizer. (not use KoreanNumberFilter)
Analyzer myAnalyzer = new Analyzer() {
`@Override`
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KoreanTokenizer(newAttributeFactory(), userDictionary, DecompoundMode.NONE, false, false);
return new TokenStreamComponents(tokenizer, tokenizer);
}
};
When using it as false, users may think the following result strange. (at least I do) ex) Input : ...사이즈... Expect1 : [.][..][사이즈][.][..] Expect2 : [...][사이즈][...] Result : [...][사이즈][.][..]
How do you think about this?
[Legacy Jira: Namgyu Kim (@danmuzi) on Sep 18 2019]
As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the others now when there are continuous punctuation marks. (사이즈.... => [사이즈] [.] [...]) But KoreanTokenizer doesn't divide when first character is punctuation. (...사이즈 => [...] [사이즈])
It looks like the result from the viterbi path, but users can think weird about the following case: ("사이즈" means "size" in Korean)
From what I checked, Nori has a punctuation characters(like . ,) in the dictionary but Kuromoji is not. ("サイズ" means "size" in Japanese)
There are some ways to resolve it like hard-coding for punctuation but it seems not good. So I think we need to discuss it.
Legacy Jira details
LUCENE-8977 by Namgyu Kim (@danmuzi) on Sep 11 2019, updated Sep 18 2019