medcl / elasticsearch-rtf

elasticsearch中文发行版,针对中文集成了相关插件,方便新手学习测试.
Apache License 2.0
2.67k stars 715 forks source link

mmseg 分词问题 #21

Closed isnolan closed 10 years ago

isnolan commented 10 years ago

在elasticsearch-rtf/config/mmseg/words-my.dic中增加了自定义的一些词汇,比如“西红柿”,但是最终结果中分词出现的结果是这样:

{
   "took": 6,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.7263499,
      "hits": [
         {
            "_index": "index",
            "_type": "fulltext",
            "_id": "5",
            "_score": 0.7263499,
            "_source": {
               "content": "西红柿,番茄,鸡蛋,面条,西红柿鸡蛋面"
            },
            "highlight": {
               "content": [
                  "<tag1>西</tag1><tag2>红</tag2><tag1>柿</tag1>,番茄,<tag1>鸡蛋</tag1>,面条,<tag1>西</tag1><tag2>红</tag2><tag1>柿</tag1><tag1>鸡蛋</tag1>面"
               ]
            }
         }
      ]
   }
}

对此,该如何处理,或是否有相关文档? thx

isnolan commented 10 years ago

words.dic文件中有“西红柿”一词,似乎是该字典未被加载所致,该如何解决?

medcl commented 10 years ago

1.确认字典使用utf8编码 2.使用complex模式 curl -XPOST http://localhost:9200/index/_analyze?analyzer=mmseg_complex -d'{ "text": "嘻嘻西红柿,真好吃,我也最好吃" }'

isnolan commented 10 years ago

哦,使用了mmseg_complex就ok了,thx

ng-wei commented 9 years ago

mmseg_complex 和 mmseg_simple 在实现上的区别是啥 @medcl

medcl commented 9 years ago

@ng-wei complex分词逻辑更加复杂,会进行一些词义消歧的操作.