infinilabs / analysis-ik

🚌 The IK Analysis plugin integrates Lucene IK analyzer into Elasticsearch and OpenSearch, support customized dictionary.
Apache License 2.0
16.58k stars 3.27k forks source link

非ASCII字符会被直接忽略 #994

Open sissilab opened 1 year ago

sissilab commented 1 year ago

使用 ik 分词会直接忽略掉非ASCII字符,如下例子:açaí à,请问这种情况如何处理?

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "açaí à la carte"
}

{
  "tokens": [
    {
      "token": "la",
      "start_offset": 7,
      "end_offset": 9,
      "type": "ENGLISH",
      "position": 0
    },
    {
      "token": "carte",
      "start_offset": 10,
      "end_offset": 15,
      "type": "ENGLISH",
      "position": 1
    }
  ]
}

此为 asciifolding filter过滤情况,能转换为ASCII字符:

GET /_analyze
{
  "tokenizer" : "standard",
  "filter" : ["asciifolding"],
  "text" : "açaí à la carte"
}

{
  "tokens": [
    {
      "token": "acai",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "a",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "la",
      "start_offset": 7,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "carte",
      "start_offset": 10,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}