Nori(Korean) tokenizer removes the decimal point. [LUCENE-8784]

mikemccand commented 5 years ago

This is the same issue that I mentioned to https://github.com/elastic/elasticsearch/issues/41401#event-2293189367

unlike standard analyzer, nori analyzer removes the decimal point.

nori tokenizer removes "." character by default. In this case, it is difficult to index the keywords including the decimal point.

It would be nice if there had the option whether add a decimal point or not.

Like Japanese tokenizer does, Nori need an option to preserve decimal point.

Legacy Jira details

LUCENE-8784 by Munkyu Im on Apr 30 2019, resolved May 27 2019 Attachments: LUCENE-8784.patch (versions: 4) Linked issues:

LUCENE-8812

mikemccand commented 5 years ago

Hi. @jimczi and @Munkyu. I uploaded a patch for this issue.

I only worked about Tokenizer and TokenizerFactory, and did not work about Analyzer. In the case of Japanese, it could not be customized. (discardPunctuation is always true) If necessary, we can easily add it to Analyzer.

However, I have a question now. The current patch was developed in such a way that it continues to pass parameters. (in isPunctuation method) If we don't use the static method, we don't have to pass the parameters every time.

What do you think about disabling static in the isPunctuation method?

[Legacy Jira: Namgyu Kim (@danmuzi) on May 21 2019]

mikemccand commented 5 years ago

Hi @danmuzi, I don't think we should have one option for every punctuation type and the current check in the patch based on Character.OTHER_PUNCTUATION would match more than just the full stop character. If we want to preserve punctuations we can add the same option than for Kuromoji (discardPunctuation) and output a token for each punctuation group. So for an input like "10.1?" we would output 4 tokens: "10", ".", "1", "?". Then if you need to "regroup" tokens based on additional rules you can add another filter to do this like the JapaneseNumberFilter does. The other option would be to detect numbers with decimal points accurately like the standard tokenizer does but we don't want to reinvent the wheel either. If we want the same grouping for unknown words in this tokenizer we should probably implement it on top of the standard or ICU tokenizer directly. .

[Legacy Jira: Jim Ferenczi (@jimczi) on May 22 2019]

mikemccand commented 5 years ago

Thank you for your reply, @jimczi :D

I tried to process only "." character in Tokenizer. Because Korean is a language that can have a whitespace in sentence, but Japanese is not. (Character.OTHER_PUNCTUATION would match more than just the full stop character. => Right. That's a problem. I have to change that part...)

JapaneseTokenizer keeps whitespace when using the discardPunctuation option.

example - "十万二千五百二千五" (means "102005 100 2005") If we run the JapaneseTokenizer with discardPunctuation=false and JapaneseNumberFilter, we get:{"102005", " ", "100", " ", "2005"}

Of course we can do it with StopFilter or internal processing in other Filter, but is it okay..?

Developing a NumberFilter looks much more flexible and structurally beautiful rather than internal processing in Tokenizer. But I have developed like this because of the above problems, how can we handle those spaces?

I think there are several ways to handle this problem: 1) Remove whitespace from Punctuation list in Tokenizer. 2) Use a TokenFilter to remove whitespace. 3) Remove whitespace from KoreanNumberFilter. (looks structurally strange...) 4) Just leave whitespace

[Legacy Jira: Namgyu Kim (@danmuzi) on May 22 2019]

mikemccand commented 5 years ago

JapaneseTokenizer keeps whitespace when using the discardPunctuation option.

I think we should do the same here, whitespaces are part of the punctuation so we should not remove them. It's a bit tricky because we remove them early on in the Korean Tokenizer. However we keep the offset of the starting whitespace in each token that we generate so it is possible to regenerate these whitespaces when we create the final tokens.

I attached a patch that does this if discardPunctuation is set to false.

Of course we can do it with StopFilter or internal processing in other Filter, but is it okay..?

I think so yes, it is easy to remove these whitespaces with the KoreanPartOfSpeechFilter, you just need to add the tag SP (that represents whitespaces) and any of the tag that represent punctuations (SY, SSC, SSO,...). So if you need to keep punctuations for another filter you can create a chain like: KoreanTokenizer(discardPunc=false), KoreanNumberFilter, KoreanPartOfSpeechFilter(-SP, -SSC, ...)

This way the number filter can see these tokens but they are never indexed (they are filtered by the pos filter).

[Legacy Jira: Jim Ferenczi (@jimczi) on May 23 2019]

mikemccand commented 5 years ago

Thank you for your reply, @jimczi!

Your approach looks awesome. I developed KoreanNumberFilter by referring to JapaneseNumberFilter. Please check my patch :D (use "git apply --whitespace=fix LUCENE-8784.patch" because of trailing whitespace error :()

I did not set KoreanNumberFilter as the default filter in KoreanAnalyzer. By the way, would not it be better to leave the constructors that do not use discardPunctuation parameters? (Existing Nori users have to modify the code after uploading)

[Legacy Jira: Namgyu Kim (@danmuzi) on May 23 2019]

mikemccand commented 5 years ago

Oh, I forgot. I also added Javadoc for discardPunctuation in your patch. (KoreanAnalyzer, KoreanTokenizerFactory)

[Legacy Jira: Namgyu Kim (@danmuzi) on May 23 2019]

mikemccand commented 5 years ago

By the way, would not it be better to leave the constructors that do not use discardPunctuation parameters? (Existing Nori users have to modify the code after uploading)

Yes we should do that, otherwise it's a breaking change and we cannot push to 8x.

I also added Javadoc for discardPunctuation in your patch. (KoreanAnalyzer, KoreanTokenizerFactory)

thanks!

I developed KoreanNumberFilter by referring to JapaneseNumberFilter. Please check my patch :D

The patch looks good but we should iterate on this in a new issue. We try to do one feature at a time in a single issue so let's add discardPunctuation in this one and we can open a new one as a follow up to add the KoreanNumberFilter ?

[Legacy Jira: Jim Ferenczi (@jimczi) on May 24 2019]

mikemccand commented 5 years ago

It's a good idea :D

I linked LUCENE-8784(discardPunctuation) with LUCENE-8812(KoreanNumberFilter). (Apply LUCENE-8784 first and then LUCENE-8812) Your suggestion made this issue cleaner.

In LUCENE-8784, I did not change the existing TCs and just added new TCs for discardPunctuation. (remain the current constructor to provide an existing API)

[Legacy Jira: Namgyu Kim (@danmuzi) on May 24 2019]

mikemccand commented 5 years ago

The last patch for this issue looks good to me. I'll test locally and merge if all tests pass.

Thanks for opening LUCENE-8812, I'll take a look when this issue gets merged.

[Legacy Jira: Jim Ferenczi (@jimczi) on May 24 2019]

mikemccand commented 5 years ago

Thanks, @jimczi! :D If there is something wrong, I would appreciate it if you let me know.

[Legacy Jira: Namgyu Kim (@danmuzi) on May 24 2019]

mikemccand commented 5 years ago

Commit a556925eb8aaea61e55010a6e5c90520139b56c3 in lucene-solr's branch refs/heads/master from Namgyu Kim https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a556925

LUCENE-8784: The KoreanTokenizer now preserves punctuations if discardPunctuation is set to false (defaults to true).

Signed-off-by: Namgyu Kim <kng0828@gmail.com> Signed-off-by: jimczi <jimczi@apache.org>

[Legacy Jira: ASF subversion and git services on May 27 2019]

mikemccand commented 5 years ago

Commit 990f62aca47fc0571c79e80f25ee1d0b7dc4a824 in lucene-solr's branch refs/heads/branch_8x from Namgyu Kim https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=990f62a

LUCENE-8784: The KoreanTokenizer now preserves punctuations if discardPunctuation is set to false (defaults to true).

Signed-off-by: Namgyu Kim <kng0828@gmail.com> Signed-off-by: jimczi <jimczi@apache.org>

[Legacy Jira: ASF subversion and git services on May 27 2019]

mikemccand commented 5 years ago

Thanks @danmuzi! I pushed to master and branch_8x. I also removed the discardPunctuations option from the KoreanAnalyzer in order to be consistent with the JapaneseAnalyzer. It's an advanced option that should be used with a specific token filter in mind (KoreanNumberFilter for instance).

[Legacy Jira: Jim Ferenczi (@jimczi) on May 27 2019]

mikemccand commented 5 years ago

Oh, I checked that discardPunctuation is removed from KoreanAnalyzer.

Thank you very much for applying my patch! @jimczi :D

[Legacy Jira: Namgyu Kim (@danmuzi) on May 27 2019]

mikemccand commented 5 years ago

Commit db334c792bedb2b385f72098a6a61458a05667e3 in lucene-solr's branch refs/heads/master from Jim Ferenczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=db334c7

LUCENE-8784: Restore the Korean's part of speech tag for NGRAM.

The part of speech tag for unigram has been changed inadvertenly in a previous commit (not released). This change restores the original value that is also set on the serialized unkwnown dictionary.

[Legacy Jira: ASF subversion and git services on May 28 2019]

mikemccand commented 5 years ago

Commit bf0d6fad4294b1d0e7c17b819be0f730d4e59c38 in lucene-solr's branch refs/heads/branch_8x from Jim Ferenczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bf0d6fa

LUCENE-8784: Restore the Korean's part of speech tag for NGRAM.

The part of speech tag for unigram has been changed inadvertenly in a previous commit (not released). This change restores the original value that is also set on the serialized unkwnown dictionary.

[Legacy Jira: ASF subversion and git services on May 28 2019]

mikemccand commented 5 years ago

Closing after the 8.2.0 release

[Legacy Jira: Ignacio Vera (@iverase) on Jul 26 2019]

mikemccand / stargazers-migration-test

Nori(Korean) tokenizer removes the decimal point. [LUCENE-8784] #782

Legacy Jira details