infinilabs / analysis-pinyin

🛵 This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.
Apache License 2.0
2.95k stars 547 forks source link

keep_first_letter 与 phrase混合使用问题 #119

Open guoxijun opened 7 years ago

guoxijun commented 7 years ago

测试分词结果 http://192.168.36.140:9200/1/_analyze?text=周大福&analyzer=pinyin_analyzer

tokens
0
token "zhou" start_offset 0 end_offset 1 type "word" position 0 1
token "zdf" start_offset 0 end_offset 3 type "word" position 0 2
token "da" start_offset 1 end_offset 2 type "word" position 1 3
token "fu" start_offset 2 end_offset 3 type "word" position 2

查询方式: curl -XGET 'http://192.168.36.140:9200/1/users/_search?pretty' -d '{ "query": { "bool":{
"must":{ "multi_match":{ "query": "zdf", "fields":["realName","realName.pinyin"], "type": "phrase" } } } } }'

发现并没有结果,查zhoudafu,zhou,dafu等就有,为啥会这样子的?

medcl commented 7 years ago

什么版本呢?

guoxijun commented 7 years ago

最新的5.4.0

guoxijun commented 7 years ago

同样我定义了另外一个分析器: "mobile_tokenizer" : { "type" : "nGram", "min_gram" : 3, "max_gram" : 20, "token_chars" : ["letter","digit"] }

mapping如下: "phones": { "type": "string", "analyzer": "mobile_analyzer" }

测试分析里面是有2219的: 25
token "2219" start_offset 3 end_offset 7 type "word" position 25

但是索引的是时候: curl -XGET 'http://192.168.36.140:9200/1/users/_search?pretty' -d '{

"query": {
    "bool":{  
        "must":{
            "multi_match":{
                "query": "2219",
                "fields":["phones"],
                "type": "phrase"
            }
        }
    }
}

}' { "took" : 0, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } } 也是没有,我用的都是5.4.0

guoxijun commented 7 years ago

我贴一下我整个设置: curl -XPUT http://192.168.36.140:9200/1/ -d' { "settings" : { "analysis" : { "analyzer" : { "pinyin_analyzer" : { "tokenizer" : "pinyin_tokenizer" }, "email_analyzer" : { "tokenizer" : "email_tokenizer", "char_filter": ["email_char_filter"] }, "mobile_analyzer" : { "tokenizer" : "mobile_tokenizer" } }, "tokenizer" : { "pinyin_tokenizer" : { "type" : "pinyin", "keep_first_letter":true, "keep_separate_first_letter" : true, "keep_full_pinyin" : true, "keep_original" : false, "limit_first_letter_length" : 16, "lowercase" : true, "keep_joined_full_pinyin":true }, "email_tokenizer" : { "type" : "nGram", "min_gram" : 1, "max_gram" : 20, "token_chars" : ["letter"] }, "mobile_tokenizer" : { "type" : "nGram", "min_gram" : 3, "max_gram" : 20, "token_chars" : ["digit"] } }, "char_filter" : { "email_char_filter" : { "type" : "pattern_replace", "pattern" : "(@.*)", "replacement" : "" } } } }, "mappings" : { "users" : { "properties" : { "realName": { "type": "keyword", "fields": { "pinyin": { "type": "text", "store": "no", "term_vector": "with_positions_offsets", "analyzer": "pinyin_analyzer", "boost":10 } } }, "emails": { "type": "string", "analyzer": "email_analyzer" }, "phones": { "type": "string", "analyzer": "mobile_analyzer" } } }, "depts" : { "properties" : { "name": { "type": "keyword", "fields": { "pinyin": { "type": "text", "store": "no", "term_vector": "with_positions_offsets", "analyzer": "pinyin_analyzer", "boost":10 } } } } }, "groups" : { "properties" : { "name": { "type": "keyword", "fields": { "pinyin": { "type": "text", "store": "no", "term_vector": "with_positions_offsets", "analyzer": "pinyin_analyzer", "boost":10 } } } } } } }'

如果不用phrase索引,结果没问题 curl -XGET 'http://192.168.36.140:9200/1/users/_search?pretty' -d '{ "query": { "bool":{
"must":{ "multi_match":{ "query": "2219", "fields":["realName","realName.pinyin","phones","emails"] } } } } }' 但是索引zdf的时候,结果就不准了,如果加上phrase类型,索引zdf的时候就准确,但是索引不到2219, 不知道你明白明白我的意思