infinilabs / analysis-ik

🚌 The IK Analysis plugin integrates Lucene IK analyzer into Elasticsearch and OpenSearch, support customized dictionary.
Apache License 2.0
16.58k stars 3.27k forks source link

ik_max_word会将英文末尾带点拆分成两个 #993

Open sissilab opened 1 year ago

sissilab commented 1 year ago

FoX. 使用 ik_max_word 会拆分成2个,请教下,如何处理,只展示 fox

若通过自定义 char_filtermapping 来映射将 . 映射为 空格,这种会影响到那些需要 . 的情况,如 U.F.O,需要保留 .

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "FoX."
}

{
  "tokens": [
    {
      "token": "fox.",
      "start_offset": 0,
      "end_offset": 4,
      "type": "LETTER",
      "position": 0
    },
    {
      "token": "fox",
      "start_offset": 0,
      "end_offset": 3,
      "type": "ENGLISH",
      "position": 1
    }
  ]
}
fengshansi commented 4 months ago

请问这个问题您解决了吗,我也有同样的疑问

sissilab commented 4 months ago

@fengshansi 可以考虑先使用 char_filter 将 . 去除,看能否满足您的需求