hankcs / hanlp-lucene-plugin

HanLP中文分词Lucene插件,支持包括Solr在内的基于Lucene的系统
http://www.hankcs.com/nlp/segment/full-text-retrieval-solr-integrated-hanlp-chinese-word-segmentation.html
Apache License 2.0
296 stars 99 forks source link

添加拼音分词过滤器 #47

Closed canghailan closed 5 years ago

canghailan commented 5 years ago

在Lucene中使用拼音搜索时不想添加其他依赖,添加了一个拼音分词过滤器,支持全拼、首字母搜索

Cydmi commented 5 years ago

为何配置在solr总搜索不到呢,请问如何配置managed-schema @canghailan

hankcs commented 5 years ago

感谢贡献!

canghailan commented 5 years ago

我目前是直接在lucene中使用的,我稍后在solr中试下。

CustomAnalyzer.builder()
.withTokenizer(HanLPTokenizerFactory.class)
.addTokenFilter(LowerCaseFilterFactory.class)
.addTokenFilter(HanLPPinyinTokenFilterFactory.class)
.build();

@Cydmi

canghailan commented 5 years ago

为何配置在solr总搜索不到呢,请问如何配置managed-schema @canghailan

我的环境是Mac,Solr版本7.7.1

  1. 拉取hanlp-lucene-plugin最新的代码,重新打包,把

    hanlp-lucene-plugin.jar
    hanlp-portable-1.6.8.jar

    拷贝到

    solr-7.7.1/server/solr-webapp/webapp/WEB-INF/lib

    中。

  2. 我在example的基础上修改了下,这个配置是可以的:

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
        <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory"/>
        <filter class="com.hankcs.lucene.HanLPPinyinTokenFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory"/>
        <filter class="com.hankcs.lucene.HanLPPinyinTokenFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
  3. 测试结果如下:

    
    http://localhost:8983/solr/test/select?q=content_t%3Azw

{ "responseHeader": { "status": 0, "QTime": 1, "params": { "q": "content_t:zw" } }, "response": { "numFound": 2, "start": 0, "docs": [ { "id": "2", "content_t": "中文", "version": 1633607220462616600 }, { "id": "3", "content_t": "中文拼音", "version": 1633607227204960300 } ] } }



@Cydmi
Cydmi commented 5 years ago

@canghailan Thank you