hankcs / HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
https://hanlp.hankcs.com/
Apache License 2.0
33.97k stars 10.18k forks source link

修复ViterbiSegment分词器中加载自定义词典时未替换DoubleArrayTrie导致分词不符合预期的问题 #1835

Closed wxy929629 closed 1 year ago

wxy929629 commented 1 year ago

修复ViterbiSegment分词器中加载自定义词典时未替换DoubleArrayTrie导致分词不符合预期的问题

Description

ViterbiSegment加载自定义词典时未正确替换DoubleArrayTrie, 导致应该被切分出的词条未被切分

Fixes # (issue)

Type of Change

Please check any relevant options and delete the rest.

How Has This Been Tested?

com/hankcs/hanlp/seg/SegmentTest.java

    public void testExtendViterbi() throws Exception
    {
        HanLP.Config.enableDebug(false);
        String path = System.getProperty("user.dir") + "/" + "data/dictionary/custom/CustomDictionary.txt;" +
            System.getProperty("user.dir") + "/" + "data/dictionary/custom/全国地名大全.txt";
        path = path.replace("\\", "/");
        String text = "一半天帕克斯曼是走不出丁字桥镇的";
        Segment segment = HanLP.newSegment().enableCustomDictionary(false);
        Segment seg = new ViterbiSegment(path);
        System.out.println("不启用字典的分词结果:" + segment.seg(text));
        System.out.println("默认分词结果:" + HanLP.segment(text));
        seg.enableCustomDictionaryForcing(true).enableCustomDictionary(true);
        List<Term> termList = seg.seg(text);
        System.out.println("自定义字典的分词结果:" + termList);
    }

image

Checklist

Check all items that apply.

hankcs commented 1 year ago

感谢pr!