ik_smart和ik_max_word分词差异怎么解决

whldoudou commented 1 year ago

ik_max_word分词效果：

`GET _analyze { "text": ["52周"], "analyzer": "ik_max_word" }

{ "tokens" : [ { "token" : "52", "start_offset" : 0, "end_offset" : 2, "type" : "ARABIC", "position" : 0 }, { "token" : "周", "start_offset" : 2, "end_offset" : 3, "type" : "COUNT", "position" : 1 } ] } `

ik_smart分词效果：

`GET _analyze { "text": ["52周"], "analyzer": "ik_smart" }

{ "tokens" : [ { "token" : "52周", "start_offset" : 0, "end_offset" : 3, "type" : "TYPE_CQUAN", "position" : 0 } ] } ` 问题是：ik_max_word 识别不出来TYPE_CQUAN类型的词，请问有解决方案没有？

whldoudou commented 1 year ago

/**

组合词元 */ private void compound(Lexeme result){

if(!this.cfg.isUseSmart()){
    return ;
}
//数量词合并处理
if(!this.results.isEmpty()){

    if(Lexeme.TYPE_ARABIC == result.getLexemeType()){
        Lexeme nextLexeme = this.results.peekFirst();
        boolean appendOk = false;
        if(Lexeme.TYPE_CNUM == nextLexeme.getLexemeType()){
            //合并英文数词+中文数词
            appendOk = result.append(nextLexeme, Lexeme.TYPE_CNUM);
        }else if(Lexeme.TYPE_COUNT == nextLexeme.getLexemeType()){
            //合并英文数词+中文量词
            appendOk = result.append(nextLexeme, Lexeme.TYPE_CQUAN);
        }
        if(appendOk){
            //弹出
            this.results.pollFirst(); 
        }
    }

    //可能存在第二轮合并
    if(Lexeme.TYPE_CNUM == result.getLexemeType() && !this.results.isEmpty()){
        Lexeme nextLexeme = this.results.peekFirst();
        boolean appendOk = false;
         if(Lexeme.TYPE_COUNT == nextLexeme.getLexemeType()){
             //合并中文数词+中文量词
            appendOk = result.append(nextLexeme, Lexeme.TYPE_CQUAN);
        }  
        if(appendOk){
            //弹出
            this.results.pollFirst();                   
        }
    }

}

}

问题就出现在组合词元中的数量词合并处理这块，为什么ik_max_word不进行数量词的合并呢？是有那方面的考量吗？

whldoudou commented 1 year ago

@medcl

hongyan1110 commented 1 year ago

@medcl 我也遇到同样的问题，理论上ik_smart应该为ik_max_word分词的子集

crossmaya commented 11 months ago

我也遇到这个问题，刚准备提issue，请问你解决了吗？一摸一样的问题

medcl commented 11 months ago

ik_smart和算法不一样，不一定是子集。

hongyan1110 commented 11 months ago

ik_smart和算法不一样，不一定是子集。

那就是说如果我使用 ik_smart 分词器搜索 ik_max_word 分词的数据，就不一定能搜索到。但是对于 ES ，索引数据使用细粒度的分词器，搜索使用粗粒度的分词器，效果才是好的。那对于 ik分词器来说，有没有这样可以搭配使用的分词器呢？ @medcl

kin122 commented 3 months ago

ik_smart和算法不一样，不一定是子集。

那就是说如果我使用 ik_smart 分词器搜索 ik_max_word 分词的数据，就不一定能搜索到。但是对于 ES ，索引数据使用细粒度的分词器，搜索使用粗粒度的分词器，效果才是好的。那对于 ik分词器来说，有没有这样可以搭配使用的分词器呢？ @medcl

写入和搜索的分词器最好还是同一个

infinilabs / analysis-ik

ik_smart和ik_max_word分词差异怎么解决 #992

`GET _analyze { "text": ["52周"], "analyzer": "ik_max_word" }

`GET _analyze { "text": ["52周"], "analyzer": "ik_smart" }