hankcs / HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
https://hanlp.hankcs.com/
Apache License 2.0
33.84k stars 10.12k forks source link

自己训练的pos.bin不能加载 #1751

Closed tianjiangtao closed 2 years ago

tianjiangtao commented 2 years ago

Describe the bug 自己训练的crf模型文件,pos.bin不能加载,但是同时生成的pos.bin.txt就可以加载成功。

Code to reproduce the issue

CRFPOSTagger tagger = new CRFPOSTagger(null); // 创建空白标注器
//        tagger = new CRFPOSTagger(PKU.POS_MODEL); // 加载
         tagger = new CRFPOSTagger("/root/repo/hanlp-java/HanLP/data/test/pos.bin"); // 加载
        System.out.println(Arrays.toString(tagger.tag("他", "的", "希望", "是", "希望", "上学"))); // 预测
        AbstractLexicalAnalyzer analyzer = new AbstractLexicalAnalyzer(new PerceptronSegmenter(), tagger); // 构造词法分析器
        System.out.println(analyzer.analyze("李狗蛋的希望是希望上学")); // 分词+词性标注

报错如下: java.lang.ArrayIndexOutOfBoundsException: 1677721600

at com.hankcs.hanlp.model.perceptron.feature.FeatureMap.loadTagSet(FeatureMap.java:99)
at com.hankcs.hanlp.model.perceptron.feature.ImmutableFeatureMDatMap.load(ImmutableFeatureMDatMap.java:92)
at com.hankcs.hanlp.model.perceptron.model.LinearModel.load(LinearModel.java:421)
at com.hankcs.hanlp.model.crf.LogLinearModel.load(LogLinearModel.java:58)
at com.hankcs.hanlp.model.perceptron.model.LinearModel.load(LinearModel.java:388)
at com.hankcs.hanlp.model.crf.LogLinearModel.<init>(LogLinearModel.java:83)
at com.hankcs.hanlp.model.crf.CRFTagger.<init>(CRFTagger.java:41)
at com.hankcs.hanlp.model.crf.CRFPOSTagger.<init>(CRFPOSTagger.java:45)
at com.hankcs.hanlp.model.crf.CRFPOSTaggerTest.testTrain(CRFPOSTaggerTest.java:25)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at junit.framework.TestCase.runTest(TestCase.java:176)
at junit.framework.TestCase.runBare(TestCase.java:141)
at junit.framework.TestResult$1.protect(TestResult.java:122)
at junit.framework.TestResult.runProtected(TestResult.java:142)
at junit.framework.TestResult.run(TestResult.java:125)
at junit.framework.TestCase.run(TestCase.java:129)
at junit.framework.TestSuite.runTest(TestSuite.java:255)
at junit.framework.TestSuite.run(TestSuite.java:250)
at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
at com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38)
at com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11)
at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35)
at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235)
at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54)

Describe the current behavior 100%出错

Expected behavior 能正常加载pos.bin

System information

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

tianjiangtao commented 2 years ago

但是使用相同的命令加载data-for-1.7.zip中带的/data/model/crf/pku199801/pos.txt.bin就可以成功加载

hankcs commented 2 years ago

See faq https://github.com/hankcs/HanLP/wiki/FAQ#%E4%B8%BA%E4%BB%80%E4%B9%88%E5%8A%A0%E8%BD%BD%E6%88%91%E8%87%AA%E5%B7%B1%E8%AE%AD%E7%BB%83%E7%9A%84crf%E6%A8%A1%E5%9E%8B%E5%A4%B1%E8%B4%A5%E4%BA%86