add KoreanNumberFilter to Nori(Korean) Analyzer [LUCENE-8812]

mikemccand commented 5 years ago

This is a follow-up issue to LUCENE-8784.

The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to regular Arabic decimal numbers in half-width characters.

Logic is similar to JapaneseNumberFilter. It should be able to cover the following test cases.

1) Korean Word to Number 십만이천오백 => 102500

2) 1 character conversion 일영영영 => 1000

3) Decimal Point Calculation 3.2천 => 3200

4) Comma between three digits 4,647.0010 => 4647.001

Legacy Jira details

LUCENE-8812 by Namgyu Kim (@danmuzi) on May 24 2019, resolved Jun 12 2019 Attachments: LUCENE-8812.patch Linked issues:

LUCENE-8784

mikemccand commented 5 years ago

Hi @jimczi :D Thank you for applying the LUCENE-8784 patch! I checked that this patch can be applied into latest code. (no conflict)

What do you think about this patch?

[Legacy Jira: Namgyu Kim (@danmuzi) on May 29 2019]

mikemccand commented 5 years ago

The patch looks good @danmuzi, I wonder if it would be difficult to have a base class for the Japanese and Korean number filter since they share a large amount of code. However I think it's ok to merge this first and we can tackle the merge in a follow up, wdyt ?

[Legacy Jira: Jim Ferenczi (@jimczi) on May 30 2019]

mikemccand commented 5 years ago

Thank you for your reply, @jimczi :D

I wonder if it would be difficult to have a base class for the Japanese and Korean number filter since they share a large amount of code. However I think it's ok to merge this first and we can tackle the merge in a follow up, wdyt ?

I think it is an awesome refactoring. If the refactoring is done, we can also share this TokenFilter in SmartChineseAnalyzer. (Chinese and Japanese use the same numeric characters) The amount of code will also be reduced.

I think the NumberFilter (new abstract class) can be in the org.apache.lucene.analysis.core(analysis-common) or org.apache.lucene.analysis(lucene-core) package, what do you think? In my personal opinion, analysis-common seems to be correct, but it is a little bit ambiguous.

[Legacy Jira: Namgyu Kim (@danmuzi) on May 30 2019]

mikemccand commented 5 years ago

Sorry I didn't see your reply. I agree with you that it is ambiguous to put it in analysis-common so +1 to add it in the nori module for now and revisit if/when we create a separate module for the mecab tokenizer.

[Legacy Jira: Jim Ferenczi (@jimczi) on Jun 06 2019]

mikemccand commented 5 years ago

Thank you for your reply. @jimczi :D

Awesome. I'll submit this patch.

This is the first time that I submit a patch manually. I will look for the manual and proceed, but it can take a little time.

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 07 2019]

mikemccand commented 5 years ago

Commit 5a75b8a0808e577f9f1a737f411104034cc23a6b in lucene-solr's branch refs/heads/master from Namgyu Kim https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5a75b8a

LUCENE-8812: Add new KoreanNumberFilter that can change Hangul character to number and process decimal point

Signed-off-by: Namgyu Kim <namgyu@apache.org>

[Legacy Jira: ASF subversion and git services on Jun 09 2019]

mikemccand commented 5 years ago

Commit 36b90c6056932c9cf2e4dfae3d500cab5d03ba3b in lucene-solr's branch refs/heads/branch_8x from Namgyu Kim https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=36b90c6

LUCENE-8812: Add new KoreanNumberFilter that can change Hangul character to number and process decimal point.

Signed-off-by: Namgyu Kim <namgyu@apache.org>

[Legacy Jira: ASF subversion and git services on Jun 09 2019]

mikemccand commented 5 years ago

Hi @jimczi :D

I pushed my commit to branch_8x and master branch. I checked and it seems to be reflected fine. So I'll resolve this issue. Let me know if there are some problems.

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 09 2019]

mikemccand commented 5 years ago

Commit 619e47b791a3cea475923a61113adf0fe39404c6 in lucene-solr's branch refs/heads/branch_8x from Namgyu Kim https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=619e47b

LUCENE-8812: disable Java 9 try-with-resources style in TestKoreanNumberFilter

Signed-off-by: Namgyu Kim <namgyu@apache.org>

[Legacy Jira: ASF subversion and git services on Jun 09 2019]

mikemccand commented 5 years ago

Commit fe58b6f3a27e1520054f74f06c3d7c2042ff4f79 in lucene-solr's branch refs/heads/master from Namgyu Kim https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fe58b6f

LUCENE-8812: disable Java 9 try-with-resources style in TestKoreanNumberFilter

Signed-off-by: Namgyu Kim <namgyu@apache.org>

[Legacy Jira: ASF subversion and git services on Jun 09 2019]

mikemccand commented 5 years ago

I re-opened this issue because the error was occurred in branch_8x branch. (https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-Windows/298/) After the error was found, the commit to fix the error was pushed. The cause of the error was that when I wrote the test(TestKoreanNumberFilter), I used Java 9 try-with-resources style, and the error was occurred in branch_8x because it is based on Java 8. So I disable it including the master branch, and when the 9.0 version is officially released, I will rework it at that time.

And I made a slight mistake. I did not mention the issue number when committing to the branch_8x branch. I wanted to change the commit message, but it could not be changed because it is a protected branch. (force push is blocked) So I was forced to revert.

Anyway, both problems(error + wrong commit message) are all my fault, and I will be careful.

I will resolve this issue when the Jenkins build is completed.

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 09 2019]

mikemccand commented 5 years ago

Thanks @danmuzi!

[Legacy Jira: Jim Ferenczi (@jimczi) on Jun 10 2019]

mikemccand commented 5 years ago

You're welcome. @jimczi! I checked that the build was completed fine at Jenkins. I'll resolve this issue.

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 12 2019]

mikemccand commented 5 years ago

Closing after the 8.2.0 release

[Legacy Jira: Ignacio Vera (@iverase) on Jul 26 2019]

mikemccand / stargazers-migration-test

add KoreanNumberFilter to Nori(Korean) Analyzer [LUCENE-8812] #810

Legacy Jira details