IK分词器在处理中文时产生了错误的偏移量

DemosHume commented 11 months ago

我在使用IK分词器处理中文文本时遇到了一个问题。

我有一个字段recommend_tags，它的值是"贝尔法斯特号"。当我尝试将这个记录插入我的索引时，我收到了一个错误: 偏移量必须是非负的，endOffset必须大于等于startOffset，而且偏移量不能倒退。

输出的错误信息如下 ('1 document(s) failed to index.', [{'index': {'_index': 'image_test_6_8_0', '_type': 'sql_record', '_id': 'WcgY0IoB7W0KhcCALXYf', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=2,endOffset=3,lastStartOffset=3 for field 'recommend_tags'"}, 'data': {'recommend_tags': '贝尔法斯特号'}}}])

当我使用分词API手动分析我的文本时，我发现问题可能出在"法"和"斯"这两个词元上。 "法"的startOffset为2，endOffset为3，然后下一个词元"斯"的startOffset也是3，这违反了偏移量不能倒退的规则。这是分词结果： {'tokens': [{'token': '贝尔法斯特', 'start_offset': 0, 'end_offset': 5, 'type': 'CN_WORD', 'position': 0}, {'token': '贝尔法', 'start_offset': 0, 'end_offset': 3, 'type': 'CN_WORD', 'position': 1}, {'token': '贝尔', 'start_offset': 0, 'end_offset': 2, 'type': 'CN_WORD', 'position': 2}, {'token': '斯', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 3}, {'token': '法', 'start_offset': 2, 'end_offset': 3, 'type': 'CN_CHAR', 'position': 4}, {'token': '斯', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 5}, {'token': '特号', 'start_offset': 4, 'end_offset': 6, 'type': 'CN_WORD', 'position': 6}]}

ik分词器版本信息如下 description=IK Analyzer for Elasticsearch version=6.8.0

索引字段信息如下：

"recommend_tags": { "type": "text", "analyzer": "ik_max_word" }

DemosHume commented 11 months ago

萨尔瓦多共和国这个词也会出问题 {'tokens': [{'token': '萨尔瓦多', 'start_offset': 0, 'end_offset': 4, 'type': 'CN_WORD', 'position': 0}, {'token': '萨尔瓦', 'start_offset': 0, 'end_offset': 3, 'type': 'CN_WORD', 'position': 1}, {'token': '萨尔', 'start_offset': 0, 'end_offset': 2, 'type': 'CN_WORD', 'position': 2}, {'token': '瓦', 'start_offset': 2, 'end_offset': 3, 'type': 'CN_CHAR', 'position': 3}, {'token': '多', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 4}]}

lizongbo commented 11 months ago

基于ES 8.10.2验证是正常的

AnalyzeRequest: POST /_analyze {"analyzer":"ik_max_word","text":["贝尔法斯特"]}====== AnalyzeResponse: {"tokens":[{"end_offset":5,"position":0,"start_offset":0,"token":"贝尔法斯特","type":"CN_WORD"},{"end_offset":3,"position":1,"start_offset":0,"token":"贝尔法","type":"CN_WORD"},{"end_offset":2,"position":2,"start_offset":0,"token":"贝尔","type":"CN_WORD"},{"end_offset":3,"position":3,"start_offset":2,"token":"法","type":"CN_CHAR"},{"end_offset":4,"position":4,"start_offset":3,"token":"斯","type":"CN_CHAR"},{"end_offset":5,"position":5,"start_offset":4,"token":"特","type":"CN_CHAR"}]}

AnalyzeRequest: POST /_analyze {"analyzer":"ik_max_word","text":["萨尔瓦多"]}======AnalyzeResponse: {"tokens":[{"end_offset":4,"position":0,"start_offset":0,"token":"萨尔瓦多","type":"CN_WORD"},{"end_offset":3,"position":1,"start_offset":0,"token":"萨尔瓦","type":"CN_WORD"},{"end_offset":2,"position":2,"start_offset":0,"token":"萨尔","type":"CN_WORD"},{"end_offset":3,"position":3,"start_offset":2,"token":"瓦","type":"CN_CHAR"},{"end_offset":4,"position":4,"start_offset":3,"token":"多","type":"CN_CHAR"}]}

kin122 commented 1 month ago

可能是这段代码的逻辑bug，试着注释掉重新编译一下

infinilabs / analysis-ik

IK分词器在处理中文时产生了错误的偏移量 #1022