文本以'\n'开头会引起高亮错位BUG

hankcs / hanlp-lucene-plugin

HanLP中文分词Lucene插件，支持包括Solr在内的基于Lucene的系统

Apache License 2.0

296 stars 99 forks source link

Closed AnyListen closed 6 years ago

AnyListen commented 6 years ago

在HighLighterTest.java测试文件中，将索引的文本内容最开始加上一个'\n'即可复现BUG。

我的测试文本：

String text2 = "\n朗坤智能云平台—LiCP\nLuculent intelligent/industrial Cloud Platform\n白皮书\n\n——跨行业跨领域工业互联网平台 ";

String keyword = "朗坤智能云平台";

AnyListen commented 6 years ago

测试结果

【
】【朗】【坤智】【能】【云平】台—LiCP
Luculent intelligent/industrial Cloud Platform
白皮书

——跨行业跨领域工业互联【网平】台
测试回测换行符 , 0.47491124

AnyListen commented 6 years ago

仔细看了一下是应该是 SegmentWrapper 中 Scanner scanner 的问题，在使用scanner.next();时，默认忽略了第一个 \n符号。

hankcs commented 6 years ago

是的，Java的Scanner和BufferedReader处理换行符时都在做多余的事情，必须自己写个Reader。请测试刚提交的补丁，如果没问题的话就发新版本。

AnyListen commented 6 years ago

已集成到ES分词插件测试，没问题啦