Handle punctuation characters in KoreanTokenizer [LUCENE-8977]

mikemccand commented 5 years ago

As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the others now when there are continuous punctuation marks. (사이즈.... => [사이즈] [.] [...]) But KoreanTokenizer doesn't divide when first character is punctuation. (...사이즈 => [...] [사이즈])

It looks like the result from the viterbi path, but users can think weird about the following case: ("사이즈" means "size" in Korean)

Case #1	Case #2
Input : "...사이즈..."	Input : "...4......4사이즈"
Result : [...] [사이즈] [.] [..]	Result : [...] [4] [.] [.....] [4] [사이즈]

From what I checked, Nori has a punctuation characters(like . ,) in the dictionary but Kuromoji is not. ("サイズ" means "size" in Japanese)

Case #1	Case #2
Input : "...サイズ..."	Input : "...4......4サイズ"
Result : [...] [サイズ] [...]	Result : [...] [4] [......] [4] [サイズ]

There are some ways to resolve it like hard-coding for punctuation but it seems not good. So I think we need to discuss it.

Legacy Jira details

LUCENE-8977 by Namgyu Kim (@danmuzi) on Sep 11 2019, updated Sep 18 2019

mikemccand commented 5 years ago

I wonder why you think that this is an issue. Punctuations are removed by default so this is only an issue if you want to use the Korean number filter ?

[Legacy Jira: Jim Ferenczi (@jimczi) on Sep 13 2019]

mikemccand commented 5 years ago

Sorry for late reply. @jimczi :(

First, I'll modify this issue from Bug to Improvement because it is ambiguous to see it as a bug.

I wonder why you think that this is an issue. Punctuations are removed by default so this is only an issue if you want to use the Korean number filter ?

As you said, the biggest purpose is KoreanNumberFilter. However, users can simply use discardPunctuation option of KoreanTokenizer. (not use KoreanNumberFilter)

Analyzer myAnalyzer = new Analyzer() {
  `@Override`
  protected TokenStreamComponents createComponents(String fieldName) {
    Tokenizer tokenizer = new KoreanTokenizer(newAttributeFactory(), userDictionary, DecompoundMode.NONE, false, false);
    return new TokenStreamComponents(tokenizer, tokenizer);
  }
};

When using it as false, users may think the following result strange. (at least I do) ex) Input : ...사이즈... Expect1 : [.][..][사이즈][.][..] Expect2 : [...][사이즈][...] Result : [...][사이즈][.][..]

How do you think about this?

[Legacy Jira: Namgyu Kim (@danmuzi) on Sep 18 2019]

mikemccand / stargazers-migration-test

Handle punctuation characters in KoreanTokenizer [LUCENE-8977] #974

Legacy Jira details