sing1ee / analyzer-solr

analyzer adapter for solr 5, we support Jieba, and stranford in the future
MIT License
61 stars 27 forks source link

Solr having problems with highlighting when using Jieba anaylzer #2

Closed edwinyeozl closed 8 years ago

edwinyeozl commented 9 years ago

I'm using Jieba analyser to index Chinese characters in the Solr. It works fine with the segmentation when using the Anaylsis on the Solr Admin UI.

However, when I tried to do highlighting in Solr, it is not highlighting in the correct place. For example, when I search for 自然环境与企业本身 it highlight 认为自然环与企业本身的. Even when I search English character responsibility, it highlight responsibility.

I'm using jieba-analysis-1.0.0, Solr 5.2.1 and Lucene 5.1.0

Regards, Edwin

sing1ee commented 8 years ago

Sorry, It's a little late。I am too busy。 Is the problem solved?

edwinyeozl commented 8 years ago

Hi Zhang Cheng,

Thank you for your reply.

Not yet, I'm still having problem with the highlighting for content field. For other fields, the highlighting works fine. I've upgraded to Solr 5.3.0, and the same problem persist as in Solr 5.2.1. I'm still using jieba-analysis-1.0.0 for both versions.

I got this highlighting results for the following query: http://localhost:8983/solr/chinese3/highlight?q=乒乓球

"highlighting":{ "chinese3test1_chinese2乒乓球":{ "id":["chinese3test1_chinese2乒乓球"], "title":["chinese2乒乓球"], "content":[" <p><br> 乒乓球,是一种世界流行的球类体育项目,也是 中 华 人民共和国 国球 。乒乓球运动是一项以技巧性为主,身体体能素质为辅的技能型项目,起源于英国。“乒乓球”一名 起源 于1900年,因其打击时发出“ping pang”的声音而得名,在中国大陆、香港及澳门等地区以“乒乓球”作为它的官方名称。 <br>乒乓球为圆球状,2000年 悉尼奥运会 之前(包括悉尼奥运会)国际比赛用球的直径"]}}}

Below is my configuration: <fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100">

Regards, Edwin

On 15 October 2015 at 10:49, Cheng Zhang notifications@github.com wrote:

Sorry, It's a little late。I am too busy。 Is the problem solved?

— Reply to this email directly or view it on GitHub https://github.com/sing1ee/analyzer-solr/issues/2#issuecomment-148265021 .

edwinyeozl commented 8 years ago

I've tried to do some minor modification in the code under JiebaSegmenter.java, and the highlighting seems to be fine now.

Basically, I created another int called offset2 under process() method. int offset2 = 0;

Then I modified the offset to offset2 for this part of the code under process() method.

    if (sb.length() > 0)
        if (mode == SegMode.SEARCH) {
            for (Word token : sentenceProcess(sb.toString())) {
                // tokens.add(new SegToken(token, offset, offset +=

token.length())); tokens.add(new SegToken(token, offset2, offset2 += token.length())); // Change to offset2 by Edwin } } else { for (Word token : sentenceProcess(sb.toString())) { if (token.length() > 2) { Word gram2; int j = 0; for (; j < token.length() - 1; ++j) { gram2 = token.subSequence(j, j + 2); if (wordDict.containsWord(gram2.getToken())) // tokens.add(new SegToken(gram2, offset + j, offset + j + 2)); tokens.add(new SegToken(gram2, offset2 + j, offset2 + j + 2)); // Change to offset2 by Edwin } } if (token.length() > 3) { Word gram3; int j = 0; for (; j < token.length() - 2; ++j) { gram3 = token.subSequence(j, j + 3); if (wordDict.containsWord(gram3.getToken())) // tokens.add(new SegToken(gram3, offset + j, offset + j + 3)); tokens.add(new SegToken(gram3, offset2 + j, offset2 + j + 3)); // Change to offset2 by Edwin } } // tokens.add(new SegToken(token, offset, offset += token.length())); tokens.add(new SegToken(token, offset2, offset2 += token.length())); // Change to offset2 by Edwin } }

Not sure if this is just a workaround, or can be used as a permanent solution

Regards, Edwin

On 15 October 2015 at 12:08, Zheng Lin Edwin Yeo edwinyeozl@gmail.com wrote:

Hi Zhang Cheng,

Thank you for your reply.

Not yet, I'm still having problem with the highlighting for content field. For other fields, the highlighting works fine. I've upgraded to Solr 5.3.0, and the same problem persist as in Solr 5.2.1. I'm still using jieba-analysis-1.0.0 for both versions.

I got this highlighting results for the following query: http://localhost:8983/solr/chinese3/highlight?q=乒乓球

"highlighting":{ "chinese3test1_chinese2乒乓球":{ "id":["chinese3test1_chinese2乒乓球"], "title":["chinese2乒乓球"], "content":[" <p><br> 乒乓球,是一种世界流行的球类体育项目,也是 中 华 人民共和国 国球 。乒乓球运动是一项以技巧性为主,身体体能素质为辅的技能型项目,起源于英国。“乒乓球”一名 起源 于1900年,因其打击时发出“ping pang”的声音而得名,在中国大陆、香港及澳门等地区以“乒乓球”作为它的官方名称。 <br>乒乓球为圆球状,2000年 悉尼奥运会 之前(包括悉尼奥运会)国际比赛用球的直径"]}}}

Below is my configuration: <fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100">

Regards, Edwin

On 15 October 2015 at 10:49, Cheng Zhang notifications@github.com wrote:

Sorry, It's a little late。I am too busy。 Is the problem solved?

— Reply to this email directly or view it on GitHub https://github.com/sing1ee/analyzer-solr/issues/2#issuecomment-148265021 .

sing1ee commented 8 years ago

@edwinyeozl thank you, you can commit your code directly!