luceneplusplus / LucenePlusPlus

Lucene++ is an up to date C++ port of the popular Java Lucene library, a high-performance, full-featured text search engine.
luceneplusplus@googlegroups.com
Other
738 stars 232 forks source link

fix a bug of ChineseTokenizer #160

Closed Kakueeen closed 3 years ago

Kakueeen commented 3 years ago

Description:When I use ChineseAnalyzer for Chinese word segmentation, I find that English and numbers are treated as one word and I think they should be separated.

RootCause:Null

Solution:

LocutusOfBorg commented 3 years ago

@alanw this is breaking testsuite...

[----------] 5 tests from ChineseTokenizerTest
[ RUN      ] ChineseTokenizerTest.testOtherLetterOffset
[       OK ] ChineseTokenizerTest.testOtherLetterOffset (0 ms)
[ RUN      ] ChineseTokenizerTest.testReusableTokenStream1
[       OK ] ChineseTokenizerTest.testReusableTokenStream1 (0 ms)
[ RUN      ] ChineseTokenizerTest.testReusableTokenStream2
[       OK ] ChineseTokenizerTest.testReusableTokenStream2 (1 ms)
[ RUN      ] ChineseTokenizerTest.testNumerics
/<<PKGBUILDDIR>>/src/test/analysis/BaseTokenStreamFixture.cpp:127: Failure
Value of: !ts->incrementToken()
  Actual: false
Expected: true
[  FAILED  ] ChineseTokenizerTest.testNumerics (0 ms)
[ RUN      ] ChineseTokenizerTest.testEnglish
[       OK ] ChineseTokenizerTest.testEnglish (0 ms)
[----------] 5 tests from ChineseTokenizerTest (1 ms total)
LocutusOfBorg commented 3 years ago

@Kakueeen ^^

Kakueeen commented 3 years ago

I know the reason why this case failed. If the content was pure numbers, the interface incrementToken would return false before this submission, but now supports pure numbers. Do I need to modify the unit test?

Kakueeen commented 3 years ago

@LocutusOfBorg @alanw

LocutusOfBorg commented 3 years ago

Hello, if you have a patch, please submit it. Right now I had to upload in Debian and Ubuntu without this pull request because it breaks tests...