infinilabs / analysis-pinyin

🛵 This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.
Apache License 2.0
2.94k stars 547 forks source link

拼音首字母查询问题,当第二个字的拼音首字母为第一个字的韵母时查询不到结果 #293

Open Jiangtao976 opened 11 months ago

Jiangtao976 commented 11 months ago

{ "settings":{ "number_of_shards":3, "number_of_replicas":1, "default_pipeline":"biz_timestamp_pipeline", "analysis":{ "analyzer":{ "pinyin_analyzer":{ "tokenizer":"my_pinyin" } }, "tokenizer":{ "my_pinyin":{ "type":"pinyin", "keep_separate_first_letter":true, "keep_full_pinyin":true, "keep_joined_full_pinyin":false, "keep_original":true, "limit_first_letter_length":16, "lowercase":true, "remove_duplicated_term":true, "ignore_pinyin_offset":false } } } }, "mappings":{ "properties":{ "vendorName":{ "type":"text", "analyzer":"pinyin_analyzer", "search_analyzer":"pinyin_analyzer", "fields":{ "keyword":{ "type":"keyword", "ignore_above":256 } } } } } }

示例一: 中文:刘德华阿里巴巴 分词结果: { "tokens": [ { "token": "l", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "liu", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "刘德华阿里巴巴", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "ldhalbb", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "d", "start_offset": 1, "end_offset": 2, "type": "word", "position": 1 }, { "token": "de", "start_offset": 1, "end_offset": 2, "type": "word", "position": 1 }, { "token": "h", "start_offset": 2, "end_offset": 3, "type": "word", "position": 2 }, { "token": "hua", "start_offset": 2, "end_offset": 3, "type": "word", "position": 2 }, { "token": "a", "start_offset": 3, "end_offset": 4, "type": "word", "position": 3 }, { "token": "li", "start_offset": 4, "end_offset": 5, "type": "word", "position": 4 }, { "token": "b", "start_offset": 5, "end_offset": 6, "type": "word", "position": 5 }, { "token": "ba", "start_offset": 5, "end_offset": 6, "type": "word", "position": 5 } ] }

查询: { "query": { "match_phrase": { "vendorName": { "query": "ldha" } } } }

可以看到分词结果中包含了首字母ldha,但查询不到结果,"阿"的首字母a,感觉是受到,"华"(hua)字中的a影响查不到。

示例二: 中文:深圳健安医药有限公司 { "tokens": [ { "token": "s", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "shen", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "深圳健安医药有限公司", "start_offset": 0, "end_offset": 10, "type": "word", "position": 0 }, { "token": "szjayyyxgs", "start_offset": 0, "end_offset": 10, "type": "word", "position": 0 }, { "token": "z", "start_offset": 1, "end_offset": 2, "type": "word", "position": 1 }, { "token": "zhen", "start_offset": 1, "end_offset": 2, "type": "word", "position": 1 }, { "token": "j", "start_offset": 2, "end_offset": 3, "type": "word", "position": 2 }, { "token": "jian", "start_offset": 2, "end_offset": 3, "type": "word", "position": 2 }, { "token": "a", "start_offset": 3, "end_offset": 4, "type": "word", "position": 3 }, { "token": "an", "start_offset": 3, "end_offset": 4, "type": "word", "position": 3 }, { "token": "y", "start_offset": 4, "end_offset": 5, "type": "word", "position": 4 }, { "token": "yi", "start_offset": 4, "end_offset": 5, "type": "word", "position": 4 }, { "token": "yao", "start_offset": 5, "end_offset": 6, "type": "word", "position": 5 }, { "token": "you", "start_offset": 6, "end_offset": 7, "type": "word", "position": 6 }, { "token": "x", "start_offset": 7, "end_offset": 8, "type": "word", "position": 7 }, { "token": "xian", "start_offset": 7, "end_offset": 8, "type": "word", "position": 7 }, { "token": "g", "start_offset": 8, "end_offset": 9, "type": "word", "position": 8 }, { "token": "gong", "start_offset": 8, "end_offset": 9, "type": "word", "position": 8 }, { "token": "si", "start_offset": 9, "end_offset": 10, "type": "word", "position": 9 } ] }

查询: { "query": { "match_phrase": { "vendorName": { "query": "szja" } } } }

可以看到分词结果中包含了首字母szja,但查询不到结果,"安"的首字母a,感觉是受到,"健"(jian)字中的a影响查不到。

其它中文,例如:深圳恩,使用sze同样查询不到,恩的首字母e 受到深(shen)字中的e影响查不到。

我调了很多参数都无法解决这个问题,有大佬救救我吗

xiaoshi2013 commented 6 months ago

查询: { "query": { "match_phrase": { "vendorName": { "query": "ldha" } } } }

可以看到分词结果中包含了首字母ldha,但查询不到结果,"阿"的首字母a,感觉是受到,"华"(hua)字中的a影响查不到。

分词结果并没有把 ldha 分成一个词,所以匹配不上, 你换成 liudehua 就可以查了