test_word2vec .py中加载词向量模型WordVectorModel(model_file)报内存不足 - Githubissues

hankcs / HanLP

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

https://hanlp.hankcs.com/en/

Apache License 2.0

33.91k stars 10.15k forks source link

test_word2vec .py中加载词向量模型WordVectorModel(model_file)报内存不足 #1013

Closed achenjie closed 6 years ago

achenjie commented 6 years ago

注意事项

请确认下列注意事项：

我已仔细阅读下列文档，都没有找到答案：
我已经通过Google和issue区检索功能搜索了我的问题，也没有找到答案。
我明白开源社区是出于兴趣爱好聚集起来的自由社区，不承担任何责任或义务。我会礼貌发言，向每一个帮助我的人表示感谢。
[x] 我在此括号内输入x打钩，代表上述事项确认完毕。

版本号

当前最新版本号是：hanlp-1.6.8 我使用的版本是：hanlp-1.6.8

我的问题

from pyhanlp import *
WordVectorModel = JClass('com.hankcs.hanlp.mining.word2vec.WordVectorModel')
model_file = 'user/data/baidubaike.txt'
word2vec = WordVectorModel(model_file)

其中baidubaike.txt文件大小4G，运行出现如下报错：

File "D:\Anaconda3\lib\site-packages\jpype\_jclass.py", line 111, in _javaInit*args)
java.lang.OutOfMemoryErrorPyRaisable: java.lang.OutOfMemoryError: GC overhead limit exceeded

修改jvm内存也没有解决问题，请问当文件较大时，如何解决这个问题？

batizhao commented 5 years ago

搭车同问，sgns.baidubaike.bigram-char 这个向量模型大约 2G，new WordVectorModel 的时候大约用了一分钟，有办法快点吗？