infinilabs / analysis-pinyin

🛵 This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.
Apache License 2.0
2.96k stars 548 forks source link

分词单个字母的问题,比如“我”会变成'w' 和 'wo',其实只想要'wo' #242

Open Molerni opened 4 years ago

Molerni commented 4 years ago

GET /pmall_goods_v2/_analyze { "analyzer": "pinyin", "text": ["我"] } 结果 { "tokens" : [ { "token" : "wo", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "w", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 } ] }

其实我只想要'wo'那个一个, 有什么办法吗?

teaGod-s commented 4 years ago

+1

jayqian commented 3 years ago

加一个官方的length token filter https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-length-tokenfilter.html 把分词结果里长度为1的term去掉。

ljj6218 commented 2 years ago

keep_first_letter启用此选项时,例如:刘德华> ldh,默认值:true keep_separate_first_letter启用该选项时,将保留第一个字母分开,例如:刘德华> l,d,h,默认:假的,注意:查询结果也许是太模糊,由于长期过频 limit_first_letter_length 设置first_letter结果的最大长度,默认值:16 keep_full_pinyin当启用该选项,例如:刘德华> [ liu,de,hua],默认值:true keep_joined_full_pinyin当启用此选项时,例如:刘德华> [ liudehua],默认值:false

shaunhurryup commented 9 months ago

keep_separate_first_letter 看起来没生效啊,如果为 false 说明不应该拆分出 l, d, h 这样的 token

Request

POST  /_analyze
{
  "tokenizer": "pinyin",
  "text": "刘德华",
  "filter": [
    {
      "type": "pinyin",
      // default to false
      "keep_separate_first_letter": false
    }
  ]
}

Response

{
  "tokens": [
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "l",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "d",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    },
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 3
    },
    {
      "token": "ldh",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 3
    },
    {
      "token": "de",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 4
    },
    {
      "token": "hua",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 5
    }
  ]
}