infinilabs / analysis-ik

🚌 The IK Analysis plugin integrates Lucene IK analyzer into Elasticsearch and OpenSearch, support customized dictionary.
Apache License 2.0
16.48k stars 3.27k forks source link

英文数词被标记为 CN_WORD #1046

Open stormyi opened 6 months ago

stormyi commented 6 months ago

Description

ik_smart(v7.10.0) 对英文数词+中文量词 组合的分词效果与预期不符,特别不能理解的是"7天"的 7 为什么是 CN_WORD? (ps:相同环境在 es 7.10.2 + ik 7.10.2 英文数词+中文量词 被标记为 TYPE_CQUAN)

Steps to reproduce

POST /_analyze '{"field":"content","analyzer":"ik_smart","text":"7天 44天 55天"}'

Expected behavior

"7天"应该是一个 TYPE_CQUAN { "token": "7天", "start_offset": 0, "end_offset": 2, "type": "TYPE_CQUAN", "position": 0 }

Actual behavior

{ "token": "7", "start_offset": 0, "end_offset": 1, "type": "CN_WORD", "position": 0 }, { "token": "天", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR", "position": 1 }

Environment

stormyi commented 6 months ago

按照我的预期,"7天"应该是一个 token,而不是被拆分为"7"和"天"。看起来是因为"7"被认为是一个 CN_WORD,所以没有和"天"组合。有人能解惑一下吗

stormyi commented 6 months ago

@medcl

medcl commented 6 months ago

嗯,这块是需要优化,IK 项目近期会整理遗留的 Bug,我们会继续完善。