infinilabs / analysis-pinyin

🛵 This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.
Apache License 2.0
2.96k stars 548 forks source link

首字母搜索,mec不能搜索木耳草 #247

Open vancefantasy opened 3 years ago

vancefantasy commented 3 years ago

索引配置

"analyzer": {
        "pinyin_analyzer": {
             "tokenizer": "my_pinyin"
        }
      }

 "tokenizer": {
       "my_pinyin": {
          "lowercase": "true",
          "keep_original": "false",
          "keep_first_letter": "true",
          "keep_separate_first_letter": "true",
          "type": "pinyin",
          "limit_first_letter_length": "64",
          "keep_full_pinyin": "true"
        }

 "properties": {
      "name": {
          "type": "keyword",
            "py": {
              "type": "text",
              "analyzer": "pinyin_analyzer",
              "search_analyzer": "pinyin_analyzer"
            } 
        }
  }

index time(木耳草)

{
"tokens": [
    {
        "token": "m",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 0
    },
    {
        "token": "mu",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 0
    },
    {
        "token": "e",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 1
    },
    {
        "token": "er",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 1
    },
    {
        "token": "c",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 2
    },
    {
        "token": "cao",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 2
    },
    {
        "token": "mec",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 2
    }
]}

search time (mec)

{
"tokens": [
    {
        "token": "me",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 0
    },
    {
        "token": "c",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 1
    },
    {
        "token": "mec",
        "start_offset": 0,
        "end_offset": 0,
        "type": "word",
        "position": 1
    }
] }

搜索时mec分词结果中包含me,使用phrase query检索时,检索不出来。有没有解决方案??

medcl commented 3 years ago

pinyin 如果产生多个重复的位置重叠的 term,本来就不适合 phrase 查询。换普通的查询应该是可以的,查询和索引都有分出 term:mec,应该可以查询出来的,

vancefantasy commented 3 years ago

@medcl 感谢回复。 使用best_fields替换phrase后,命中范围有点大,一些不相干的结果都出来了 如果指定search 的analyzer为keyword_analyzer,可以搜出来,解决了当前场景的问题,但是会引入其他问题,例如搜muer就不行了,有点难搞哦

yanjiali2020 commented 1 year ago

我用示例里的medcl3, POST /medcl3/_doc/lucy {"name":"敏感的心"} 发现搜索mingan,会搜出ming/an, min/gan都不到;但是分词里是有min, gan,搜索mg是可以的 这个怎么解决 GET /medcl3/_validate/query?explain { "query": {"match": { "name.pinyin": "mingan" }} }