huaban / jieba-analysis

结巴分词(java版)
https://github.com/huaban/jieba-analysis
Apache License 2.0
2.55k stars 835 forks source link

分词出现重复问题 #107

Open li995495592 opened 4 years ago

li995495592 commented 4 years ago

有一些词有两种分词方式,结果中将两种分词方式放在前后返回了,例如"注意事项",分词结果编程了“注意、事项、注意事项”,明明只有一个注意事项,结果出现了两个注意事项,又如“校门口”,分词结果变成“校门、门口、校门口”,三个字在结果中变成了七个字,这样不行吧

zhaochuanzhen commented 4 years ago
Path path = Paths.get(new 
        File(getClass().getClassLoader().getResource("dicts/intent.dict").getPath()).getAbsolutePath());
WordDictionary.getInstance().loadUserDict(path);
JiebaSegmenter segmenter = new JiebaSegmenter();
List<SegToken> result = segmenter.process(text, JiebaSegmenter.SegMode.SEARCH)

JiebaSegmenter.SegMode.SEARCH : 这个参数设置上即可 你默认的应该是: JiebaSegmenter.SegMode.INDEX