infinilabs / analysis-ik

🚌 The IK Analysis plugin integrates Lucene IK analyzer into Elasticsearch and OpenSearch, support customized dictionary.
Apache License 2.0
16.48k stars 3.27k forks source link

ik无法按照main.dic字典分词,比如创立,已经在词典了,但ik_smart的时候分不出来 #1060

Open jiankunking opened 4 months ago

jiankunking commented 4 months ago

Description

ik无法按照main.dic字典分词,比如创立,已经在词典了,但ik_smart的时候分不出来

Steps to reproduce

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "什么时候创立了公司?"
}

分词结果

{
  "tokens" : [
    {
      "token" : "什么时候",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "创",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "立了",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "公司",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

了 是停止词,不知道为啥会分出 "立了"

Expected behavior

{
  "tokens" : [
    {
      "token" : "什么时候",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "创立",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "公司",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

Environment

kin122 commented 1 month ago

改成ik_max_word模式吧