繁中某些句子無法在solr中Analyze

benchuang11046 commented 8 years ago

您好在Analyze繁中的一些句子中，會發生錯誤的狀況，如下

我的schema.xml如下

<fieldType  name = "text_zh"  class = "solr.TextField" >
    <analyzer  type = "index" >
        <tokenizer  class = "com.hankcs.lucene.HanLPTokenizerFactory"  enableIndexMode = "true" enableTraditionalChineseMode="true" enableOrganizationRecognize="true" enablePlaceRecognize="true" customDictionaryPath="/root/hanlp/hanlp_custom_dic.txt /root/hanlp/hanlp_modern_dic.txt /root/hanlp/hanlp_people.txt" enableNormalization="false" />
    </analyzer>
    <analyzer  type = "query" >
            <tokenizer  class = "com.hankcs.lucene.HanLPTokenizerFactory"  enableIndexMode = "true" />
    </analyzer>
</fieldType>

若是將正規化開啟，也會有些句子無法Analyze，如下

請問針對部分繁中Analyze失敗的情況，有什麼比較好的配置嗎? 謝謝

hankcs commented 8 years ago

感谢反馈，能否将引发问题的句子贴一下，方便处理。

benchuang11046 commented 8 years ago

有開啟正規化設定，會引發問題的句子為

吵架吵到快取消結婚了

逢甲夜市跟士林夜市有賣馬汀鞋嗎

遼寧夜市、景美夜市、臨江夜市  哪一個夜市好吃的東西較多 又好逛？我要從西門站...

沒有開啟正規化設定，會引發問題的句子為

【婚禮-喜帖謝卡】我們的喜帖謝卡|奢華大亨小傳簡約蒂凡內早餐|自製明信片謝卡|

進退兩難! 大貨車一頭鑽進機車地下道沒注意限高 大貨車卡機車道壓碎標線不熟路況闖機車道 貨車駕駛依法開罰

謝謝

hankcs commented 8 years ago

感谢反馈，已经修复。

简繁分词引入了来自Tony-Wang的一个patch，并且后来我们进行了一些调整
由于我并非繁体用户，所以没有进行充分的测试，抱歉引发了问题
现在应该已经修复，欢迎测试。

benchuang11046 commented 8 years ago

您好感謝commit修正版本，使用了新commit的版本，有發現幾個句子無法分析

會辦台星保證最低價的原因？
台灣之星收訊真的很好
台灣之星易付卡
排球---史上頭一遭！
關於黑貓直送的重量限制

schema.xml 配置如之前(有正規化) 可以幫忙驗證一下嗎? 謝謝

hankcs commented 8 years ago

我没有重现这个问题。

        Map<String, String> args = new TreeMap<>();
        args.put("enableTraditionalChineseMode", "true");
        args.put("enableNormalization", "true");
        HanLPTokenizerFactory factory = new HanLPTokenizerFactory(args);
        Tokenizer tokenizer = factory.create();
        String text = "會辦台星保證最低價的原因？";

        tokenizer.setReader(new StringReader(text));
        tokenizer.reset();
        while (tokenizer.incrementToken())
        {
            CharTermAttribute attribute = tokenizer.getAttribute(CharTermAttribute.class);
            // 偏移量
            OffsetAttribute offsetAtt = tokenizer.getAttribute(OffsetAttribute.class);
            // 距离
            PositionIncrementAttribute positionAttr = tokenizer.getAttribute(PositionIncrementAttribute.class);
            // 词性
            TypeAttribute typeAttr = tokenizer.getAttribute(TypeAttribute.class);
            System.out.printf("[%d:%d %d] %s/%s\n", offsetAtt.startOffset(), offsetAtt.endOffset(), positionAttr.getPositionIncrement(), attribute, typeAttr.type());
        }

[0:2 1] 会办/nz
[2:3 1] 台/q
[3:4 1] 星/n
[4:6 1] 保证/v
[6:9 1] 最低价/n
[9:10 1] 的/uj
[10:12 1] 原因/n
[12:13 1] 。/w

hankcs commented 8 years ago

上次commit是commit到hanlp的，请重新编译hanlp然后利用上述test case进行测试。

benchuang11046 commented 8 years ago

似乎是solr的問題，重新啟動solr就可以分析了感謝

hankcs / hanlp-lucene-plugin

繁中某些句子無法在solr中Analyze #5