infinilabs / analysis-ik

🚌 The IK Analysis plugin integrates Lucene IK analyzer into Elasticsearch and OpenSearch, support customized dictionary.
Apache License 2.0
16.55k stars 3.27k forks source link

中文more like this查询,highlight的词汇不对 #1077

Open xuetaofeng opened 2 days ago

xuetaofeng commented 2 days ago

Description

中文more like this查询,highlight的词汇不对。 比如我查询 “项目经理”,但是返回的结果highlight的是: “高< em>级项目经< /em>理(”

Steps to reproduce

创建ik_smart的index

!/usr/bin/bash

curl -X DELETE "localhost:9201/my_index" curl -X PUT "localhost:9201/my_index" -H 'Content-Type: application/json' -d' { "settings": { "analysis": { "analyzer": { "my_ik_smart": { "type": "custom", "tokenizer": "ik_smart" } } } }, "mappings": { "properties": { "title": { "type": "text", "analyzer": "my_ik_smart", "position_increment_gap": 1, "term_vector": "with_positions_offsets_payloads" } } } } '

插入文档:

! /usr/bin/bash

curl -X POST "localhost:9201/my_index/_doc/1" -H 'Content-Type: application/json' -d @- << 'EOF' { "title": [ "项目经理", "ex Mingyuan - 前任 明源福州 销售负责人(till 06/2019)/ 前任 用友 高级项目经理(till 03/2020)", "销售负责人(till 06/2019)/ 前任 用友 高级项目经理(till 03/2020)" ] } EOF

curl -X POST "localhost:9201/my_index/_doc/2" -H 'Content-Type: application/json' -d @- << 'EOF' { "title": [ "开发工程师", "前任 Google 软件工程师经理", "现任 Facebook 高级开发工程师" ] } EOF

curl -X POST "localhost:9201/my_index/_doc/3" -H 'Content-Type: application/json' -d @- << 'EOF' { "title": [ "数据分析师", "前任 IBM 数据分析师", "现任 Amazon 数据科学家", "现任 Amazon 项目数据科学家" ] } EOF

使用more like this 和 highlight 查询:

! /usr/bin/bash

curl -X POST "localhost:9201/my_index/_search?pretty" -H 'Content-Type: application/json' -d @- << 'EOF' { "query": { "more_like_this": { "fields": ["title"], "like": "项目经理", "min_term_freq": 1, "min_doc_freq": 1, "analyzer": "my_ik_smart" } }, "highlight": { "fields": { "title": {"type": "fvh", "fragment_size": 150, "number_of_fragments": 3} } } } EOF

Priovde your configuration or code snippet that helps.

Expected behavior

期望项目经理可以得到highlight

Actual behavior

得到结果,:

{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 1.3648179, "hits" : [ { "_index" : "my_index", "_id" : "1", "_score" : 1.3648179, "_source" : { "title" : [ "项目经理", "ex Mingyuan - 前任 明源福州 销售负责人(till 06/2019)/ 前任 用友 高级项目经理(till 03/2020)", "销售负责人(till 06/2019)/ 前任 用友 高级项目经理(till 03/2020)" ] }, "highlight" : { "title" : [ "项目经理", "ex Mingyuan - 前任 明源福州 销售负责人(till 06/2019)/ 前任 用友 高级项目经理(till 03/2020)", "销售负责人(till 06/2019)/ 前任 用友 高< em>级项目经< /em>理(till 03/2020)" ] } } ] } }

Environment

xuetaofeng commented 1 day ago

我用smartcn 分词器就没有问题。只有使用ik_smart, ik_max_word有问题。 注意是对数组存在highlight的问题。