iunderstand / SWE

SWE Toolkit. Learning Semantic Word Embeddings based on Ordinal Knowledge Constraints. A general framework to incorporate semantic knowledge into the popular data-driven learning process of word vectors. Applications including word similarity, sentence completion, etc. ACL-2015, Beijing, China
Apache License 2.0
51 stars 12 forks source link

SWE_Train.c有bug?出现Segmentation fault #1

Closed chenbjin closed 7 years ago

chenbjin commented 7 years ago

补充:使用word2vec训练词向量正常。

hello,请问如何训练词向量?假设现在要训练SWE+Synon-Anton,我尝试如下步骤:

机器配置:Ubuntu 14.04 128G RAM

  1. 提供wikipedia语料train.txt
  2. 将 semantics/SWE.EN.KnowDB.WordNet-Book.Synon-Anton划分为sem.train.txt和sem.valid.txt
  3. 运行 ./SWE_Train -train train.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 0 -iter 3 -sem-train sem.train.txt -sem-valid sem.valid.txt -sem-coeff 0.1 -sem-hinge 0.0 -sem-addtime 0 -weight-decay 0 -delta-left 1 -delta-right 1

出现问题: 读取语料后出现Segmentation fault!

另外请问划分WordNet数据train和valid的比例?论文中并无提及

log如下: Semantic Word Embedding (SWE) ToolkitTrain Setting embedding size: 200 Train Setting window size: 5 Train Setting sample value: 0.000100 Train Setting negative num: 5 Running Threads: 12 Iteration Times: 3 SemWE Qsem train file: ../semantics/SWE.EN.KnowDB.WordNet-Book.Synon-Anton.train SemWE Qsem valid file: ../semantics/SWE.EN.KnowDB.WordNet-Book.Synon-Anton.valid SemWE Add Time(/%): 0.000000 SemWE Weight Decay: 0.000000 SemWE Inter Coeff: 0.100000 SemWE Norm Hinge Margin: 0.000000 SemWE Inequation Delta Left: 1 SemWE Inequation Delta Right: 1 Training Starting @Time: Fri Feb 24 19:21:08 2017

Starting training using file wikicorpus.1b Vocab size: 218317 Words in train file: 123353508 Load Training Word Knowledge from file ../semantics/SWE.EN.KnowDB.WordNet-Book.Synon-Anton.train --- InEquation Nums: 424732 --- Finish reading the Knowledge Database Load CV Test Word Knowledge from file ../semantics/SWE.EN.KnowDB.WordNet-Book.Synon-Anton.valid --- CV set InEquation Nums: 1000 ./run.sh: line 5: 25479 Segmentation fault (core dumped) ./SWE_Train -train ${TRAIN_FILE} -output vec.bin -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 1 -cbow 0 -iter 3 -sem-train ${SEW_FILE} -sem-valid ${SEW_CV_FILE} -sem-coeff 0.1 -sem-hinge 0.0 -sem-addtime 0 -weight-decay 0 -delta-left 1 -delta-right 1

iunderstand commented 7 years ago

你好,

你需要保证,语义限制集合中的所有词语,都在语料词典中。

刘权

在 2017年2月24日,20:08,bbking notifications@github.com 写道:

hello,请问如何训练词向量?假设现在要训练SWE+Synon-Anton,我尝试如下步骤:

机器配置:Ubuntu 14.04 128G RAM

提供wikipedia语料train.txt 将 semantics/SWE.EN.KnowDB.WordNet-Book.Synon-Anton划分为sem.train.txt和sem.valid.txt 运行 ./SWE_Train -train train.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 0 -iter 3 -sem-train sem.train.txt -sem-valid sem.valid.txt -sem-coeff 0.1 -sem-hinge 0.0 -sem-addtime 0 -weight-decay 0 -delta-left 1 -delta-right 1 出现问题: 读取语料后无法出现Segmentation fault!

log如下: Semantic Word Embedding (SWE) ToolkitTrain Setting embedding size: 200 Train Setting window size: 5 Train Setting sample value: 0.000100 Train Setting negative num: 5 Running Threads: 12 Iteration Times: 3 SemWE Qsem train file: ../semantics/SWE.EN.KnowDB.WordNet-Book.Synon-Anton.train SemWE Qsem valid file: ../semantics/SWE.EN.KnowDB.WordNet-Book.Synon-Anton.valid SemWE Add Time(/%): 0.000000 SemWE Weight Decay: 0.000000 SemWE Inter Coeff: 0.100000 SemWE Norm Hinge Margin: 0.000000 SemWE Inequation Delta Left: 1 SemWE Inequation Delta Right: 1

Training Starting @Time: Fri Feb 24 19:21:08 2017

Starting training using file wikicorpus.1b Vocab size: 218317 Words in train file: 123353508

Load Training Word Knowledge from file ../semantics/SWE.EN.KnowDB.WordNet-Book.Synon-Anton.train --- InEquation Nums: 424732 --- Finish reading the Knowledge Database Load CV Test Word Knowledge from file ../semantics/SWE.EN.KnowDB.WordNet-Book.Synon-Anton.valid --- CV set InEquation Nums: 1000 ./run.sh: line 5: 25479 Segmentation fault (core dumped) ./SWE_Train -train ${TRAIN_FILE} -output vec.bin -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 1 -cbow 0 -iter 3 -sem-train ${SEW_FILE} -sem-valid ${SEW_CV_FILE} -sem-coeff 0.1 -sem-hinge 0.0 -sem-addtime 0 -weight-decay 0 -delta-left 1 -delta-right 1

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

chenbjin commented 7 years ago

非常感谢! 另外想请问划分语义限制数据train和valid的比例?我想在该模型最优状态下做对比实验。

iunderstand commented 7 years ago

一般是取5%到20%的比例做开发集。

刘权

在 2017年2月24日,23:44,bbking notifications@github.com 写道:

非常感谢! 另外想请问划分语义限制数据train和valid的比例?我想在该模型最优状态下做对比实验。

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

chenbjin commented 7 years ago

好的,多谢指导!