Closed BenjaminHess closed 8 years ago
Are you able to send the text files, or send the stack trace from running the following?
$ gdb -ex=r --args build/glove -save-file vectors -input-file cooccurrence.shuf.bin -vector-size 2 -vocab-file vocab.txt -binary 0
... Segfault
$ (gdb) bt
babi.txt vocab.txt vectors.txt
Let me know if you need gdb, specifically. Here is the lldb output:
lldb -- ../../glove/build/glove -save-file vectors -input-file cooccurrence.shuf.bin -vector-size 2 -vocab-file vocab.txt -binary 0
(lldb) target create "../../glove/build/glove"
Current executable set to '../../glove/build/glove' (x86_64).
(lldb) settings set -- target.run-args "-save-file" "vectors" "-input-file" "cooccurrence.shuf.bin" "-vector-size" "2" "-vocab-file"
"vocab.txt" "-binary" "0"
(lldb) r
r
Process 3283 launched: '../../glove/build/glove' (x86_64)
TRAINING MODEL
Read 1090 lines.
Initializing parameters...done.
vector size: 2
vocab size: 33
x_max: 100.000000
alpha: 0.750000
iter: 001, cost: nan
iter: 002, cost: nan
iter: 003, cost: nan
iter: 004, cost: nan
iter: 005, cost: nan
iter: 006, cost: nan
iter: 007, cost: nan
iter: 008, cost: nan
iter: 009, cost: nan
iter: 010, cost: nan
iter: 011, cost: nan
iter: 012, cost: nan
iter: 013, cost: nan
iter: 014, cost: nan
iter: 015, cost: nan
iter: 016, cost: nan
iter: 017, cost: nan
Process 3283 stopped
* thread #2: tid = 0x62f0, 0x0000000100000f10 glove`glove_thread + 512, stop reason = EXC_BAD_ACCESS (code=1, address=0xffc01500)
frame #0: 0x0000000100000f10 glove`glove_thread + 512
glove`glove_thread:
-> 0x100000f10 <+512>: vmovsd (%rdx,%rcx,8), %xmm1
0x100000f15 <+517>: vmulsd (%rax,%rcx,8), %xmm1, %xmm1
0x100000f1a <+522>: vaddsd %xmm0, %xmm1, %xmm0
0x100000f1e <+526>: incq %rcx
(lldb)
Okay, @BenjaminHess thank you for the reproducable bug. It should be addressed by https://github.com/stanfordnlp/GloVe/commit/eb4714ec94eefc83f84d7fa32350b6ea6b5c3e8a. The issue is that tokenization of words -> indices for the babi corpus is failing in some locations, for a reason that I have yet to identify (perhaps the existence of a blank '' word.) Anyway, skipping word pairs where one word has index == 0 seems to fix the problem at hand for the time being. Word indices are assumed to start at 1 and the 0 index words were resulting in out-of-bounds memory access.
I have a small corpus (165,367 words, 33 unique) from bAbi. After successfully creating the ancillary files, the following command produces
nan
vectors orSegmentation fault: 11
.build/glove -save-file vectors -input-file cooccurrence.shuf.bin -vector-size 2 -vocab-file vocab.txt -binary 0
From vectors.txt: