huaban / jieba-analysis

结巴分词(java版)
https://github.com/huaban/jieba-analysis
Apache License 2.0
2.56k stars 837 forks source link

Problems with highlighting in Solr when using Jieba anaylzer #16

Open edwinyeozl opened 9 years ago

edwinyeozl commented 9 years ago

Hi, I'm using Jieba analyser to index Chinese characters in the Solr. It works fine with the segmentation when using the Anaylsis on the Solr Admin UI.

However, when I tried to do highlighting in Solr, it is not highlighting in the correct place. For example, when I search for 自然环境与企业本身, it highlight 认|em|为自然环|/em||em|境|/em||em|与企|/em||em|业本|/em|身的

Even when I search English character responsibility, it highlight |em| responsibilit|em|y.

Basically the highlighting goes off by 1 character/space consistently. Anyone knows what could be the issue?

I'm using jieba-analysis-1.0.0, Solr 5.2.1 and Lucene 5.1.0

Regards, Edwin

jdkcn commented 9 years ago

Hi,Edwin

I use the jieba-analysis-1.0.0 , Solr 5.1.0 and lucene 5.1.0. no such problem.

please see this query:

http://openlaw.cn/search/judgement/default?type=&typeValue=&courtId=&lawFirmId=&docType=&causeId=&judgeDateYear=&lawSearch=&litigationType=&judgeId=&zoneId=&procedureType=&keyword=%E6%8D%9F%E5%A4%B1%E9%87%91%E9%A2%9D+%E8%AE%A1%E7%AE%97%E6%98%AF%E5%90%A6%E5%87%86%E7%A1%AE

I use this project for solr integration.

https://github.com/sing1ee/analyzer-solr

Regards,

Rory

edwinyeozl commented 9 years ago

Hi Rory,

Thank you for your reply.

I tried on Solr 5.1.0 and I got this result, which is not correct too.:

http://localhost:8983/solr/chinese3/highlight?q=乒乓球

"highlighting":{ "chinese3_chinese2乒乓球":{ "id":["chinese3_chinese2乒乓球"], "title":["chinese2乒乓球"], "content":[" \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 乒乓球,是一种世界流行的球类体育项目,也是 中 华 人民共和国 国球 。乒乓球运动是一项以技"]}}}

Below is my configuration: <fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100">

However, I found that all the other fields actually work fine It's only the content that is having issue. The same goes for the one in Solr 5.3.1. Only the content has mis-aligment of the tag, while the other fields' tag are at the correct position.

Regards, Edwin

On 9 October 2015 at 09:11, Dingfu.Ye notifications@github.com wrote:

Hi,Edwin

I use the jieba-analysis-1.0.0 , Solr 5.1.0 and lucene 5.1.0. no such problem.

please see this query:

http://openlaw.cn/search/judgement/default?type=&typeValue=&courtId=&lawFirmId=&docType=&causeId=&judgeDateYear=&lawSearch=&litigationType=&judgeId=&zoneId=&procedureType=&keyword=%E6%8D%9F%E5%A4%B1%E9%87%91%E9%A2%9D+%E8%AE%A1%E7%AE%97%E6%98%AF%E5%90%A6%E5%87%86%E7%A1%AE

I use this project for solr integration.

https://github.com/sing1ee/analyzer-solr

Regards,

Rory

— Reply to this email directly or view it on GitHub https://github.com/huaban/jieba-analysis/issues/16#issuecomment-146729753 .

edwinyeozl commented 9 years ago

Hi Rory,

I got the following result in Solr 5.3.0 by using the same query and configuration:

"highlighting":{ "chinese3test1_chinese2乒乓球":{ "id":["chinese3test1_chinese2乒乓球"], "title":["chinese2乒乓球"], "content":[" <p><br> 乒乓球,是一种世界流行的球类体育项目,也是 中 华 人民共和国 国球 。乒乓球运动是一项以技巧性为主,身体体能素质为辅的技能型项目,起源于英国。“乒乓球”一名 起源 于1900年,因其打击时发出“ping pang”的声音而得名,在中国大陆、香港及澳门等地区以“乒乓球”作为它的官方名称。 <br>乒乓球为圆球状,2000年 悉尼奥运会 之前(包括悉尼奥运会)国际比赛用球的直径"]}}}

Regards, Edwin

On 9 October 2015 at 15:25, Zheng Lin Edwin Yeo edwinyeozl@gmail.com wrote:

Hi Rory,

Thank you for your reply.

I tried on Solr 5.1.0 and I got this result, which is not correct too.:

http://localhost:8983/solr/chinese3/highlight?q=乒乓球

"highlighting":{ "chinese3_chinese2乒乓球":{ "id":["chinese3_chinese2乒乓球"], "title":["chinese2乒乓球"], "content":[" \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 乒乓球,是一种世界流行的球类体育项目,也是 中 华 人民共和国 国球 。乒乓球运动是一项以技"]}}}

Below is my configuration: <fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100">

However, I found that all the other fields actually work fine It's only the content that is having issue. The same goes for the one in Solr 5.3.1. Only the content has mis-aligment of the tag, while the other fields' tag are at the correct position.

Regards, Edwin

On 9 October 2015 at 09:11, Dingfu.Ye notifications@github.com wrote:

Hi,Edwin

I use the jieba-analysis-1.0.0 , Solr 5.1.0 and lucene 5.1.0. no such problem.

please see this query:

http://openlaw.cn/search/judgement/default?type=&typeValue=&courtId=&lawFirmId=&docType=&causeId=&judgeDateYear=&lawSearch=&litigationType=&judgeId=&zoneId=&procedureType=&keyword=%E6%8D%9F%E5%A4%B1%E9%87%91%E9%A2%9D+%E8%AE%A1%E7%AE%97%E6%98%AF%E5%90%A6%E5%87%86%E7%A1%AE

I use this project for solr integration.

https://github.com/sing1ee/analyzer-solr

Regards,

Rory

— Reply to this email directly or view it on GitHub https://github.com/huaban/jieba-analysis/issues/16#issuecomment-146729753 .

edwinyeozl commented 8 years ago

Hi Rory,

I found that for English characters, the JiebaTokenizerFactory is cutting the words at the wrong place. For example, it will cut the word "water" as follows:

w|at|er

It means that Solr will search for 3 separate words of "w", "at" and "er" instead of the entire word "water".

Is there anyway to solve this problem, besides using a separate field for English and Chinese characters?

Here's my configuration in schmea.xml for the JiebaTokenizerFactory. <fieldType name="text_chinese2" class="solr.TextField" positionIncrementGap="100">

<field name="content" type="text_chinese2" indexed="true" stored="true" omitNorms="true" termVectors="true"/>

Regards, Edwin

On 9 October 2015 at 16:03, Zheng Lin Edwin Yeo edwinyeozl@gmail.com wrote:

Hi Rory,

I got the following result in Solr 5.3.0 by using the same query and configuration:

"highlighting":{ "chinese3test1_chinese2乒乓球":{ "id":["chinese3test1_chinese2乒乓球"], "title":["chinese2乒乓球"], "content":[" <p><br> 乒乓球,是一种世界流行的球类体育项目,也是 中 华 人民共和国 国球 。乒乓球运动是一项以技巧性为主,身体体能素质为辅的技能型项目,起源于英国。“乒乓球”一名 起源 于1900年,因其打击时发出“ping pang”的声音而得名,在中国大陆、香港及澳门等地区以“乒乓球”作为它的官方名称。 <br>乒乓球为圆球状,2000年 悉尼奥运会 之前(包括悉尼奥运会)国际比赛用球的直径"]}}}

Regards, Edwin

On 9 October 2015 at 15:25, Zheng Lin Edwin Yeo edwinyeozl@gmail.com wrote:

Hi Rory,

Thank you for your reply.

I tried on Solr 5.1.0 and I got this result, which is not correct too.:

http://localhost:8983/solr/chinese3/highlight?q=乒乓球

"highlighting":{ "chinese3_chinese2乒乓球":{ "id":["chinese3_chinese2乒乓球"], "title":["chinese2乒乓球"], "content":[" \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 乒乓球,是一种世界流行的球类体育项目,也是 中 华 人民共和国 国球 。乒乓球运动是一项以技"]}}}

Below is my configuration: <fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100">

However, I found that all the other fields actually work fine It's only the content that is having issue. The same goes for the one in Solr 5.3.1. Only the content has mis-aligment of the tag, while the other fields' tag are at the correct position.

Regards, Edwin

On 9 October 2015 at 09:11, Dingfu.Ye notifications@github.com wrote:

Hi,Edwin

I use the jieba-analysis-1.0.0 , Solr 5.1.0 and lucene 5.1.0. no such problem.

please see this query:

http://openlaw.cn/search/judgement/default?type=&typeValue=&courtId=&lawFirmId=&docType=&causeId=&judgeDateYear=&lawSearch=&litigationType=&judgeId=&zoneId=&procedureType=&keyword=%E6%8D%9F%E5%A4%B1%E9%87%91%E9%A2%9D+%E8%AE%A1%E7%AE%97%E6%98%AF%E5%90%A6%E5%87%86%E7%A1%AE

I use this project for solr integration.

https://github.com/sing1ee/analyzer-solr

Regards,

Rory

— Reply to this email directly or view it on GitHub https://github.com/huaban/jieba-analysis/issues/16#issuecomment-146729753 .

edwinyeozl commented 8 years ago

I've tried to do some minor modification in the code under JiebaSegmenter.java, and the highlighting seems to be fine now.

Basically, I created another int called offset2 under process() method. int offset2 = 0;

Then I modified the offset to offset2 for this part of the code under process() method.

    if (sb.length() > 0)
        if (mode == SegMode.SEARCH) {
            for (Word token : sentenceProcess(sb.toString())) {
                // tokens.add(new SegToken(token, offset, offset +=

token.length())); tokens.add(new SegToken(token, offset2, offset2 += token.length())); // Change to offset2 by Edwin } } else { for (Word token : sentenceProcess(sb.toString())) { if (token.length() > 2) { Word gram2; int j = 0; for (; j < token.length() - 1; ++j) { gram2 = token.subSequence(j, j + 2); if (wordDict.containsWord(gram2.getToken())) // tokens.add(new SegToken(gram2, offset + j, offset + j + 2)); tokens.add(new SegToken(gram2, offset2 + j, offset2 + j + 2)); // Change to offset2 by Edwin } } if (token.length() > 3) { Word gram3; int j = 0; for (; j < token.length() - 2; ++j) { gram3 = token.subSequence(j, j + 3); if (wordDict.containsWord(gram3.getToken())) // tokens.add(new SegToken(gram3, offset + j, offset + j + 3)); tokens.add(new SegToken(gram3, offset2 + j, offset2 + j + 3)); // Change to offset2 by Edwin } } // tokens.add(new SegToken(token, offset, offset += token.length())); tokens.add(new SegToken(token, offset2, offset2 += token.length())); // Change to offset2 by Edwin } }

Not sure if this is just a workaround, or can be used as a permanent solution

Regards, Edwin

On 4 November 2015 at 18:06, Zheng Lin Edwin Yeo edwinyeozl@gmail.com wrote:

Hi Rory,

I found that for English characters, the JiebaTokenizerFactory is cutting the words at the wrong place. For example, it will cut the word "water" as follows:

w|at|er

It means that Solr will search for 3 separate words of "w", "at" and "er" instead of the entire word "water".

Is there anyway to solve this problem, besides using a separate field for English and Chinese characters?

Here's my configuration in schmea.xml for the JiebaTokenizerFactory. <fieldType name="text_chinese2" class="solr.TextField" positionIncrementGap="100">

<field name="content" type="text_chinese2" indexed="true" stored="true" omitNorms="true" termVectors="true"/>

Regards, Edwin

On 9 October 2015 at 16:03, Zheng Lin Edwin Yeo edwinyeozl@gmail.com wrote:

Hi Rory,

I got the following result in Solr 5.3.0 by using the same query and configuration:

"highlighting":{ "chinese3test1_chinese2乒乓球":{ "id":["chinese3test1_chinese2乒乓球"], "title":["chinese2乒乓球"], "content":[" <p><br> 乒乓球,是一种世界流行的球类体育项目,也是 中 华 人民共和国 国球 。乒乓球运动是一项以技巧性为主,身体体能素质为辅的技能型项目,起源于英国。“乒乓球”一名 起源 于1900年,因其打击时发出“ping pang”的声音而得名,在中国大陆、香港及澳门等地区以“乒乓球”作为它的官方名称。 <br>乒乓球为圆球状,2000年 悉尼奥运会 之前(包括悉尼奥运会)国际比赛用球的直径"]}}}

Regards, Edwin

On 9 October 2015 at 15:25, Zheng Lin Edwin Yeo edwinyeozl@gmail.com wrote:

Hi Rory,

Thank you for your reply.

I tried on Solr 5.1.0 and I got this result, which is not correct too.:

http://localhost:8983/solr/chinese3/highlight?q=乒乓球

"highlighting":{ "chinese3_chinese2乒乓球":{ "id":["chinese3_chinese2乒乓球"], "title":["chinese2乒乓球"], "content":[" \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n 乒乓球,是一种世界流行的球类体育项目,也是 中 华 人民共和国 国球 。乒乓球运动是一项以技"]}}}

Below is my configuration: <fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100">

However, I found that all the other fields actually work fine It's only the content that is having issue. The same goes for the one in Solr 5.3.1. Only the content has mis-aligment of the tag, while the other fields' tag are at the correct position.

Regards, Edwin

On 9 October 2015 at 09:11, Dingfu.Ye notifications@github.com wrote:

Hi,Edwin

I use the jieba-analysis-1.0.0 , Solr 5.1.0 and lucene 5.1.0. no such problem.

please see this query:

http://openlaw.cn/search/judgement/default?type=&typeValue=&courtId=&lawFirmId=&docType=&causeId=&judgeDateYear=&lawSearch=&litigationType=&judgeId=&zoneId=&procedureType=&keyword=%E6%8D%9F%E5%A4%B1%E9%87%91%E9%A2%9D+%E8%AE%A1%E7%AE%97%E6%98%AF%E5%90%A6%E5%87%86%E7%A1%AE

I use this project for solr integration.

https://github.com/sing1ee/analyzer-solr

Regards,

Rory

— Reply to this email directly or view it on GitHub https://github.com/huaban/jieba-analysis/issues/16#issuecomment-146729753 .

finch0001 commented 8 years ago

I think you need do this: JiebaAdapter.java / public synchronized void reset(Reader input) change: raw = bdr.toString().trim();
to raw = bdr.toString();