infinilabs / analysis-ik

🚌 The IK Analysis plugin integrates Lucene IK analyzer into Elasticsearch and OpenSearch, support customized dictionary.
Apache License 2.0
16.48k stars 3.27k forks source link

类似#195的match_phrase问题 #204

Closed smilesfc closed 7 years ago

smilesfc commented 8 years ago

medcl大神,我觉的ik分词的position可能有问题

先描述下问题:原文中为“前次募集资金”,索引用的ik_max_word,搜索时match_phrase搜“前次募集资金”没问题,match_phrase搜“前次募集”啥也搜不到。

ik_max_word的analyzer测试: _analyze?analyzer=ik_max_word&text=前次募集资金 返回: {"tokens":[{"token":"前次","start_offset":0,"end_offset":2,"type":"CN_WORD","position":0},{"token":"募集","start_offset":2,"end_offset":4,"type":"CN_WORD","position":1},{"token":"募","start_offset":2,"end_offset":3,"type":"CN_WORD","position":2},{"token":"集","start_offset":3,"end_offset":4,"type":"CN_CHAR","position":3},{"token":"基金","start_offset":4,"end_offset":6,"type":"CN_WORD","position":4}]}

相关mapping: [ElasticProperty(IncludeInAll = false, IndexAnalyzer = "ik_max_word", SearchAnalyzer = "ik_max_word")] public string Title { get; set; }

第一次用“前次募集资金”搜索:

{
  "_source": "false",
  "highlight": {
    "fields": {
      "title": {}
    }
  },
  "query": {
    "match_phrase": {
      "title": {
        "query": "前次募集资金"
      }
    }
  }
}

返回有结果: {"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":21,"max_score":3.6136353,"hits":[{"_index":"disclosure.main.alpha","_type":"esdisclosurecomp","_id":"73419","_score":3.6136353,"highlight":{"title":["国金证券:<em>前次募集资金</em>使用情况报告"]}}]}}

然后第二次去掉“资金”:

{
  "_source": "false",
  "highlight": {
    "fields": {
      "title": {}
    }
  },
  "size": 1,
  "query": {
    "match_phrase": {
      "title": {
        "query": "前次募集"
      }
    }
  }
}

此时返回无匹配: {"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

我就奇了怪了,于是加了term vector,翻国金证券这篇,找到:

"前次" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 5,
            "start_offset" : 5,
            "end_offset" : 7
          } ]
        },
"募集" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 7,
            "start_offset" : 7,
            "end_offset" : 9
          } ]
        },
"募集资金" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 6,
            "start_offset" : 7,
            "end_offset" : 11
          } ]
        },

我觉得问题就在这里,在ik_max_word下,募集资金本身是一个完整的词,同时又可以被分为募集和资金,而募集的position离前次已经差1个词了,所以match_phrase不认为前次募集可以构成一个词组。您看看是不是这么回事,然后想问下有没有解决的方法,谢谢!

medcl commented 8 years ago

@smilesfc 麻烦贴完整可复现的restful脚本,我这边测试没有你说的问题,和position没有关系的 `POST index/type3/_mapping { "properties": { "myname":{ "type": "string" , "analyzer": "ik_max_word" } } }

PUT index/type3/1 { "myname":"国金证券:前次募集资金使用情况报告" }

POST index/type3/_search { "_source": "false", "highlight": { "fields": { "title": {} } }, "query": { "match_phrase": { "myname": { "query": "募集资金" } } } }`

{ "took": 6, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "index", "_type": "type3", "_id": "1", "_score": 1 } ] } }

smilesfc commented 8 years ago

就你贴的这个脚本就可以,但是"募集资金"是添加在自定义词典myDict里面的。如果不加这个词的话没有问题,加了之后才出现的。

medcl commented 8 years ago

@smilesfc 我加到词典里面也没有出现这个问题,完整的复现流程贴一下吧,和我上面的格式一样,用sense

wuyadong commented 8 years ago

@medcl 我也遇到了类似的问题。 ES版本: 2.3.1, 2.3.3 使用 ik_max_word 测试了多组数据,总结了情况:

  1. 和filter无关
  2. 只在phrase 检索情况下出现
  3. 只有在分词字典中最长的词会出现,如 "北京宝软科技有限公司" 和 "宝软科技有限公司"都在字典中,但是搜索 "宝软科技有限公司" 没有问题,"北京宝软科技有限公司" 搜索不到。
  4. 将 "北京宝软科技有限公司" 从字典中删除重启,搜索 "北京宝软科技有限公司" 依然搜索不到

附上 "北京宝软科技有限公司" 分词器结果,我肉眼看看觉得也没啥问题:

{
  "tokens": [
    {
      "token": "北京宝软科技有限公司",
      "start_offset": 0,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "北京",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "宝软科技有限公司",
      "start_offset": 2,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "宝软科技",
      "start_offset": 2,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "宝软",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "科技",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "有限公司",
      "start_offset": 6,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "有限",
      "start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "公司",
      "start_offset": 8,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 8
    }
  ]
}
medcl commented 8 years ago

词典修改之后,需要重建索引的

wuyadong commented 8 years ago
  1. 分词结果是词典修改前的
  2. 那么就是问题还是最长的那个分词会搜不到?
medcl commented 8 years ago

@wuyadong 麻烦给我一下复现脚本吧

wuyadong commented 8 years ago

@medcl 如下, 我重新索引测试了字典中是否有 "北京宝软科技有限公司"的两种情况,没有的情况下能检索到,有的情况下检索不到。

put index_test

{
    "mappings": {
          "test": {
              "properties": {
                  "name": {
                      "type": "string",
                      "analyzer": "ik_max_word",
                      "search_analyzer": "ik_max_word",
                      "include_in_all": "true"
                  }
              }
          }
    }

}

put index_test/test/1

{
    "name" : "北京宝软科技有限公司"
}

put index_test/test/2

{
    "name" : "宝软科技有限公司"
}

put index_test/test/3

{
    "name" : "宝软科技"
}

put index_test/test/4

{
    "name" : "网易科技"
}

post index_test/test/_search
{
    "query":{
      "bool" : {
        "must" : {
              "match" : {
                "name" : {
                  "query" : "北京宝软科技有限公司",
                  "type" : "phrase"
                }
              }
            }
        }
    }
}
smilesfc commented 8 years ago

@wuyadong 我猜测是search anlayzer在搜索时,不是按照北京宝软科技有限公司拆的,而是按照其他方式拆的,比如北京/宝软/科技/有限公司。因为宝软的position是4,北京是0,所以match_phrase认为两者不挨着,所以没搜到。放松slop就可以搜到。

medcl commented 8 years ago

@wuyadong 奇怪,我什么我这边用你的脚本就是无法复现,都是能查出来的,也是用的2.3

wuyadong commented 8 years ago

@medcl 难道是其它配置导致的?我读了下配置:

mapping

"mappings": {
"test": {
"properties": {
"name": {
"include_in_all": true,
"analyzer": "ik_max_word",
"type": "string"
}
}
}
},

ik字典扩展配置,只保留了停止词扩展和自己的字典;字典里有 北京宝软科技有限公司、宝软科技有限公司、有限公司等等,还比较多。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">custom/myself.dic</entry>
     <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords">custom/ext_stopword.dic</entry>
    <!--用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

ES配置,仅配置IK分词器,其它就是本地IP之类配置,不会有影响:

index.analysis.analyzer.default.type : ik
wuyadong commented 8 years ago

@medcl 搜索 宝软科技有限公司 结果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 2.472951,
    "hits": [
      {
        "_index": "index_test",
        "_type": "test",
        "_id": "2",
        "_score": 2.472951,
        "_source": {
          "name": "宝软科技有限公司"
        }
      },
      {
        "_index": "index_test",
        "_type": "test",
        "_id": "1",
        "_score": 0.67124057,
        "_source": {
          "name": "北京宝软科技有限公司"
        }
      }
    ]
  }
}

北京宝软科技有限公司 结果:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}
wuyadong commented 8 years ago

@medcl 有复现吗? 注意字典配置。 @smilesfc 不清楚内部哦,不过你猜的可能是对的。只是 北京宝软科技有限公司 在字典中存在,应该直接分成一个词呀。或者我要指定搜索使用 ik_smart?

wuyadong commented 8 years ago

@smilesfc @medcl 修改了mapping,指定了"search_analyzer": "ik_smart"结果能搜索出来。 ik_max_word 下搜不到。

{
    "mappings": {
          "test": {
              "properties": {
                  "name": {
                      "type": "string",
                      "analyzer": "ik_max_word",
                      "search_analyzer": "ik_smart",
                      "include_in_all": "true"
                  }
              }
          }
    }                 
}

搜索

{"query":
{
  "bool" : {
    "must" : {
          "match" : {
            "name" : {
              "query" : "北京宝软科技有限公司",
              "type" : "phrase"
            }
          }
        }
  }
}
}

结果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.095891505,
    "hits": [
      {
        "_index": "index_test",
        "_type": "test",
        "_id": "1",
        "_score": 0.095891505,
        "_source": {
          "name": "北京宝软科技有限公司"
        }
      }
    ]
  }
}
wuyadong commented 8 years ago

@medcl 我刚又脚步测试了下,使用 ik_max_word,又正常了,我重建下index看看是不是生产数据库也会正常? 重建好痛苦。

wuyadong commented 8 years ago

@medcl @smilesfc 真的好了。回忆了下最近做了什么,唯一的改变就是修电路,服务器重启过一次。留下记录,以后再遇到的兄弟也许可以尝试下。。。

smilesfc commented 8 years ago

我还是觉得这里面有雷,我仔细研究下分词的源码。

medcl commented 7 years ago

phrase 会使用到 position,phrase 适合分出来的词没有位置重叠的场景,如果有重叠,slop 计算的时候可能会有问题

pengqiuyuan commented 6 years ago

相同的问题 @medcl @smilesfc 能否帮看下是为什么

curl -XPUT http://127.0.0.1:9200/ikindex2

curl -XPOST http://127.0.0.1:9200/ikindex2/fulltext2/_mapping -d'
{
  "fulltext2": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_max_word"
      }
    }
  }
}'

curl -XPOST http://127.0.0.1:9200/ikindex2/fulltext2/1 -d'
{
  "content": "国家主席习近平和夫人彭丽媛为金砖国家和对话会受邀国领导人夫妇举行欢迎宴会"
}'

curl -XPOST http://127.0.0.1:9200/ikindex2/fulltext2/_search?pretty  -d'
{
     "query" : { "match_phrase" : { "content" : {"query":"金砖国家","slop":0,"analyzer": "ik_max_word" }} },
      "highlight" : {
          "pre_tags" : ["<tag1", "<tag2"],
         "post_tags" : ["</tag1", "</tag2"],
          "fields" : {
              "content" : {}
          }
      }
}'

curl 'http://127.0.0.1:9200/ikindex2/_analyze?analyzer=ik_max_word&pretty=true' -d '
{
  "text":"国家主席习近平和夫人彭丽媛为金砖国家和对话会受邀国领导人夫妇举行欢迎宴会"
}'

{
  "tokens" : [
    {
      "token" : "国家主席",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "国家",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "家",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "主席",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "习近平",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "平和",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "夫人",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "彭丽媛",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "彭",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "丽",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "媛",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "为",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "CN_CHAR",
      "position" : 11
    },
    {
      "token" : "金砖",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "国家",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "家和",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "家",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "和",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "CN_CHAR",
      "position" : 16
    },
    {
      "token" : "对话会",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "CN_WORD",
      "position" : 17
    },
    {
      "token" : "对话",
      "start_offset" : 19,
      "end_offset" : 21,
      "type" : "CN_WORD",
      "position" : 18
    },
    {
      "token" : "会受",
      "start_offset" : 21,
      "end_offset" : 23,
      "type" : "CN_WORD",
      "position" : 19
    },
    {
      "token" : "受邀",
      "start_offset" : 22,
      "end_offset" : 24,
      "type" : "CN_WORD",
      "position" : 20
    },
    {
      "token" : "邀",
      "start_offset" : 23,
      "end_offset" : 24,
      "type" : "CN_WORD",
      "position" : 21
    },
    {
      "token" : "国",
      "start_offset" : 24,
      "end_offset" : 25,
      "type" : "CN_CHAR",
      "position" : 22
    },
    {
      "token" : "领导人",
      "start_offset" : 25,
      "end_offset" : 28,
      "type" : "CN_WORD",
      "position" : 23
    },
    {
      "token" : "领导",
      "start_offset" : 25,
      "end_offset" : 27,
      "type" : "CN_WORD",
      "position" : 24
    },
    {
      "token" : "人夫",
      "start_offset" : 27,
      "end_offset" : 29,
      "type" : "CN_WORD",
      "position" : 25
    },
    {
      "token" : "夫妇",
      "start_offset" : 28,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 26
    },
    {
      "token" : "妇",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 27
    },
    {
      "token" : "举行",
      "start_offset" : 30,
      "end_offset" : 32,
      "type" : "CN_WORD",
      "position" : 28
    },
    {
      "token" : "欢迎宴会",
      "start_offset" : 32,
      "end_offset" : 36,
      "type" : "CN_WORD",
      "position" : 29
    },
    {
      "token" : "欢迎",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "CN_WORD",
      "position" : 30
    },
    {
      "token" : "宴会",
      "start_offset" : 34,
      "end_offset" : 36,
      "type" : "CN_WORD",
      "position" : 31
    },
    {
      "token" : "宴",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "CN_WORD",
      "position" : 32
    },
    {
      "token" : "会",
      "start_offset" : 35,
      "end_offset" : 36,
      "type" : "CN_CHAR",
      "position" : 33
    }
  ]
}
pengqiuyuan commented 6 years ago

搜索 金砖国家 出不来结果。slop 设置为1 可以。但是 金砖国家 明明是相邻的啊。@medcl

pengqiuyuan commented 6 years ago

es 和 ik 都是 5.4.0