KoreanTokenizer should split unknown words on digits [LUCENE-8966]

mikemccand commented 4 years ago

Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer groups characters of unknown words if they belong to the same script or an inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work well on digits since they are considered common with other scripts. For instance the input "44사이즈" is kept as is even though "사이즈" is part of the dictionary. We should restore the original behavior and splits any unknown words if a digit is followed by another type.

This issue was first discovered in https://github.com/elastic/elasticsearch/issues/46365

Legacy Jira details

LUCENE-8966 by Jim Ferenczi (@jimczi) on Sep 05 2019, resolved Sep 13 2019 Attachments: LUCENE-8966.patch (versions: 2)

mikemccand commented 4 years ago

Here is a patch that breaks unknown words on digits instead of grouping them with other types.

[Legacy Jira: Jim Ferenczi (@jimczi) on Sep 05 2019]

mikemccand commented 4 years ago

✔ +1 overall

Vote	Subsystem	Runtime	Comment
Prechecks
+1	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
master Compile Tests
+1	compile	0m 40s	master passed
Patch Compile Tests
+1	compile	0m 37s	the patch passed
+1	javac	0m 36s	the patch passed
+1	Release audit (RAT)	0m 36s	the patch passed
+1	Check forbidden APIs	0m 36s	the patch passed
+1	Validate source patterns	0m 36s	the patch passed
Other Tests
+1	unit	0m 43s	nori in the patch passed.
	4m 35s

Subsystem	Report/Notes
JIRA Issue	LUCENE-8966
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12979538/LUCENE-8966.patch
Optional Tests	compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname	Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	ant
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision	master / 78b6530fb26
ant	version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019
Default Java	LTS
Test Results	https://builds.apache.org/job/PreCommit-LUCENE-Build/205/testReport/
modules	C: lucene/analysis/nori U: lucene/analysis/nori
Console output	https://builds.apache.org/job/PreCommit-LUCENE-Build/205/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Sep 05 2019]

mikemccand commented 4 years ago

isNumber() is dead code?

[Legacy Jira: Uwe Schindler (@uschindler) on Sep 05 2019]

mikemccand commented 4 years ago

Also you are using a private isDigit() at one place, the other place uses Character.isDigit(). This should be consistent.

[Legacy Jira: Uwe Schindler (@uschindler) on Sep 05 2019]

mikemccand commented 4 years ago

Thanks for looking @thetaphi. These two private static functions are dead codes that I forgot to remove. The other places use Character.isDigit() consistently.

[Legacy Jira: Jim Ferenczi (@jimczi) on Sep 05 2019]

mikemccand commented 4 years ago

New patch without the dead code

[Legacy Jira: Jim Ferenczi (@jimczi) on Sep 05 2019]

mikemccand commented 4 years ago

Would you consider grouping numbers and (at least some) punctuation together so that we can preserve decimals and fractions?

[Legacy Jira: Michael Sokolov (@msokolov) on Sep 05 2019]

mikemccand commented 4 years ago

Would you consider grouping numbers and (at least some) punctuation together so that we can preserve decimals and fractions?

For complex number grouping and normalization, @danmuzi added a KoreanNumberFilter in https://issues.apache.org/jira/browse/LUCENE-8812

It is identical to the JapaneseNumberFilter excepts that it only detects Korean hangul numbers. I don't think it handles fractions though but this could be added if needed.

[Legacy Jira: Jim Ferenczi (@jimczi) on Sep 05 2019]

mikemccand commented 4 years ago

✔ +1 overall

Vote	Subsystem	Runtime	Comment
Prechecks
+1	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
master Compile Tests
+1	compile	0m 35s	master passed
Patch Compile Tests
+1	compile	0m 35s	the patch passed
+1	javac	0m 35s	the patch passed
+1	Release audit (RAT)	0m 35s	the patch passed
+1	Check forbidden APIs	0m 35s	the patch passed
+1	Validate source patterns	0m 35s	the patch passed
Other Tests
+1	unit	0m 38s	nori in the patch passed.
	4m 6s

Subsystem	Report/Notes
JIRA Issue	LUCENE-8966
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12979562/LUCENE-8966.patch
Optional Tests	compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname	Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	ant
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision	master / 78b6530fb26
ant	version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019
Default Java	LTS
Test Results	https://builds.apache.org/job/PreCommit-LUCENE-Build/206/testReport/
modules	C: lucene/analysis/nori U: lucene/analysis/nori
Console output	https://builds.apache.org/job/PreCommit-LUCENE-Build/206/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Sep 05 2019]

mikemccand commented 4 years ago

For complex number grouping and normalization, Namgyu Kim added a KoreanNumberFilter in https://issues.apache.org/jira/browse/LUCENE-8812

Ah thanks, I'll have a look

[Legacy Jira: Michael Sokolov (@msokolov) on Sep 06 2019]

mikemccand commented 4 years ago

Good job! @jimczi :D

It can be serious enough for Nori users.

About Punctuation, as @jimczi said, it can be remained by using discardPunctuation(set false) parameter in KoreanTokenizer.

You can test it by using analyzerWithPunctuation instance in TestKoreanTokenizer.

[Legacy Jira: Namgyu Kim (@danmuzi) on Sep 07 2019]

mikemccand commented 4 years ago

But there is a bug I just checked :(

Input : "4......4사이즈" Expected : [4] [......] [4] [사이즈] Actual : [4] [.] [.....] [4] [사이즈]

// Need to pass!
public void testDuplicatePunctuation() throws IOException {
  assertAnalyzesTo(analyzerWithPunctuation, "4......4사이즈",
      new String[]{"4", "......", "4", "사이즈"},
      new int[]{0, 1, 7, 8},
      new int[]{1, 7, 8, 11},
      new int[]{1, 1, 1, 1}
  );
}

I think we need to fix it. If it is okay to fix within this JIRA issue, I'll post additional patch. Otherwise I'll create a new one.

[Legacy Jira: Namgyu Kim (@danmuzi) on Sep 07 2019]

mikemccand commented 4 years ago

I don't think it's a bug @danmuzi or at least that it's related to this issue. In your example the first dot ('.' is a word dictionary) is considered a better path than grouping all dots eagerly. We process the unknown words greedily so we compare the path "[4], [.], [.....]" with "[4], [.], [.], [....]", "[4], [.], [.], [.], [...]", ... "[4], [......]". Keeping the first dot separated from the rest indicates that a number followed by a dot is a better splitting path than multiple dots in our model. We can discuss this behavior in a new issue if you think this should be configurable (for instance the JapaneseTokenizer process unknown words greedily only in search mode) ?

[Legacy Jira: Jim Ferenczi (@jimczi) on Sep 09 2019]

mikemccand commented 4 years ago

Oh, Thank you for your reply. @jimczi :D

I checked again and it was not bug. That result is come from viterbi path.

But I think it needs to be discussed. So I added a new issue about it.

I'd appreciate if you check LUCENE-8977.

P.S. +1 to your patch

[Legacy Jira: Namgyu Kim (@danmuzi) on Sep 11 2019]

mikemccand commented 4 years ago

Commit c8f36238ab0d99e2dc60b944030f90832b850737 in lucene-solr's branch refs/heads/master from jimczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c8f3623

LUCENE-8966: The Korean analyzer split tokens on boundaries between digits and alphabetic characters.

[Legacy Jira: ASF subversion and git services on Sep 13 2019]

mikemccand commented 4 years ago

Commit c4815f04c06256dbc9b28afdeb8b9c689198fd7d in lucene-solr's branch refs/heads/branch_8x from jimczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c4815f0

LUCENE-8966: The Korean analyzer now splits tokens on boundaries between digits and alphabetic characters.

[Legacy Jira: ASF subversion and git services on Sep 13 2019]

mikemccand commented 4 years ago

Commit ec1ef2bce652686416a6b1d9f57da7805fb1f93c in lucene-solr's branch refs/heads/master from jimczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ec1ef2b

LUCENE-8966: update CHANGES.txt after backport

[Legacy Jira: ASF subversion and git services on Sep 13 2019]

mikemccand commented 2 years ago

Closing after the 9.0.0 release

[Legacy Jira: Adrien Grand (@jpountz) on Dec 08 2021]

mikemccand / stargazers-migration-test

KoreanTokenizer should split unknown words on digits [LUCENE-8966] #963

Legacy Jira details