hankcs / HanLP

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification
https://hanlp.hankcs.com/en/
Apache License 2.0
33.91k stars 10.15k forks source link

test_word2vec .py中加载词向量模型WordVectorModel(model_file)报内存不足 #1013

Closed achenjie closed 6 years ago

achenjie commented 6 years ago

注意事项

请确认下列注意事项:

版本号

当前最新版本号是:hanlp-1.6.8 我使用的版本是:hanlp-1.6.8

我的问题

from pyhanlp import *
WordVectorModel = JClass('com.hankcs.hanlp.mining.word2vec.WordVectorModel')
model_file = 'user/data/baidubaike.txt'
word2vec = WordVectorModel(model_file)

其中baidubaike.txt文件大小4G,运行出现如下报错:

File "D:\Anaconda3\lib\site-packages\jpype\_jclass.py", line 111, in _javaInit*args)
java.lang.OutOfMemoryErrorPyRaisable: java.lang.OutOfMemoryError: GC overhead limit exceeded

修改jvm内存也没有解决问题,请问当文件较大时,如何解决这个问题?

batizhao commented 5 years ago

搭车同问,sgns.baidubaike.bigram-char 这个向量模型大约 2G,new WordVectorModel 的时候大约用了一分钟,有办法快点吗?