hankcs / HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
https://hanlp.hankcs.com/
Apache License 2.0
33.99k stars 10.18k forks source link

SpringBoot加载相对路径data,报数组越界异常 #1788

Closed Sunywdev closed 2 years ago

Sunywdev commented 2 years ago

Describe the bug SpringBoot使用portable-1.8.3版本,修改了root路径为相对路径,放置在了resources/nlp目录,自定义了IOAdapter,使用NLPTokenizer时报错,报错信息如下

Exception in thread "main" java.lang.ExceptionInInitializerError
    at com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer.<clinit>(AbstractLexicalAnalyzer.java:57)
    at com.hankcs.hanlp.tokenizer.NLPTokenizer.<clinit>(NLPTokenizer.java:39)
    at com.holly.top.springframework.config.TestNlp.main(TestNlp.java:21)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32621
    at com.hankcs.hanlp.utility.ByteUtil.bytesHighFirstToChar(ByteUtil.java:255)
    at com.hankcs.hanlp.corpus.io.ByteArray.nextChar(ByteArray.java:87)
    at com.hankcs.hanlp.dictionary.other.CharType.<clinit>(CharType.java:94)
    ... 3 more

Code to reproduce the issue

public static void main(String[] args) {
        List<Term> ff=NLPTokenizer.segment("我的名字叫邓欣雨");
        for (Term term : ff) {
            System.out.println(term);
        }
    }

Describe the current behavior IO适配器,已在hanlp.properties中打开

public class ResourceFileIoAdapter implements IIOAdapter {
    @Override
    public InputStream open(String path) throws IOException {
        ClassPathResource resource = new ClassPathResource(path);
        InputStream is = new FileInputStream(resource.getFile());
        return is;
    }

    @Override
    public OutputStream create(String path) throws IOException {
        ClassPathResource resource = new ClassPathResource(path);
        OutputStream os = new FileOutputStream(resource.getFile());
        return os;
    }
}

Expected behavior 希望能将data打包到jar包中,使用Nlp分词器

System information

Other info / logs 我的配置信息如下

root=nlp/
#IO适配器,实现com.hankcs.hanlp.corpus.io.IIOAdapter接口以在不同的平台(Hadoop、Redis等)上运行HanLP
#默认的IO适配器如下,该适配器是基于普通文件系统的。
IOAdapter=com.holly.top.springframework.config.ResourceFileIoAdapter

文件目录 image

hankcs commented 2 years ago

Portable版有官方的适配器:

https://github.com/hankcs/HanLP/blob/6817646a344e77f27ae4649a3c222c91e061355b/src/main/java/com/hankcs/hanlp/corpus/io/ResourceIOAdapter.java#L19

根据上图,你配置文件的root应该填nlp。最好打印一下com.hankcs.hanlp.HanLP.Config#CoreDictionaryPath看看是否生效。

Sunywdev commented 2 years ago

@hankcs nlp/已经修改为了nlp,这里是com.hankcs.hanlp.HanLP.Config#CoreDictionaryPath的值,已经生效了 image

hankcs commented 2 years ago

请使用官方的适配器。

Sunywdev commented 2 years ago

@hankcs 我使用了官方的适配器,配置如下

root=nlp
IOAdapter=com.hankcs.hanlp.corpus.io.ResourceIOAdapter

抛出异常

十月 11, 2022 12:04:47 下午 com.hankcs.hanlp.corpus.io.IOUtil readBytes
警告: 读取nlp/data/dictionary/other/CharType.bin时发生异常java.io.FileNotFoundException: nlp\data\dictionary\other\CharType.bin (系统找不到指定的路径。)
Exception in thread "main" java.lang.ExceptionInInitializerError
    at com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer.<clinit>(AbstractLexicalAnalyzer.java:57)
    at com.hankcs.hanlp.tokenizer.NLPTokenizer.<clinit>(NLPTokenizer.java:39)
    at com.holly.top.springframework.config.TestNlp.main(TestNlp.java:21)
Caused by: java.lang.IllegalArgumentException: 字符类型对应表 nlp/data/dictionary/other/CharType.bin 加载失败: java.io.FileNotFoundException: nlp\data\dictionary\other\CharType.bin (系统找不到指定的路径。)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:101)
    at com.hankcs.hanlp.corpus.io.ResourceIOAdapter.create(ResourceIOAdapter.java:31)
    at com.hankcs.hanlp.corpus.io.IOUtil.newOutputStream(IOUtil.java:697)
    at com.hankcs.hanlp.dictionary.other.CharType.generate(CharType.java:134)
    at com.hankcs.hanlp.dictionary.other.CharType.<clinit>(CharType.java:85)
    at com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer.<clinit>(AbstractLexicalAnalyzer.java:57)
    at com.hankcs.hanlp.tokenizer.NLPTokenizer.<clinit>(NLPTokenizer.java:39)
    at com.holly.top.springframework.config.TestNlp.main(TestNlp.java:21)

    at com.hankcs.hanlp.dictionary.other.CharType.<clinit>(CharType.java:89)
    ... 3 more

image

hankcs commented 2 years ago

java.io.FileNotFoundException: nlp\data\dictionary\other\CharType.bin (系统找不到指定的路径。)说明该文件没有被打包进jar,与HanLP无关,请自行检查。

最傻瓜的方法是用zip解压工具打开hanlp-portable-1.8.3.jar,把完整的data文件夹替换进去。

hankcs com 2022-10-11 at 12 14 04 AM

Sunywdev commented 2 years ago

@hankcs 看了一下jar中确实有这个文件存在😵 image

hankcs commented 2 years ago

你的jar里面的nlp不在根目录,应该跟META-INF平级?