stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

nan output: pthreads or corpus size? #14

Closed BenjaminHess closed 8 years ago

BenjaminHess commented 8 years ago

I have a small corpus (165,367 words, 33 unique) from bAbi. After successfully creating the ancillary files, the following command produces nan vectors or Segmentation fault: 11. build/glove -save-file vectors -input-file cooccurrence.shuf.bin -vector-size 2 -vocab-file vocab.txt -binary 0

From vectors.txt:

down nan nan
put nan nan
<unk> nan nan
ghost commented 8 years ago

Are you able to send the text files, or send the stack trace from running the following?

$ gdb -ex=r --args build/glove -save-file vectors -input-file cooccurrence.shuf.bin -vector-size 2 -vocab-file vocab.txt -binary 0
... Segfault
$ (gdb) bt
BenjaminHess commented 8 years ago

babi.txt vocab.txt vectors.txt

Let me know if you need gdb, specifically. Here is the lldb output:

lldb -- ../../glove/build/glove -save-file vectors -input-file cooccurrence.shuf.bin -vector-size 2 -vocab-file vocab.txt -binary 0
(lldb) target create "../../glove/build/glove"
Current executable set to '../../glove/build/glove' (x86_64).
(lldb) settings set -- target.run-args  "-save-file" "vectors" "-input-file" "cooccurrence.shuf.bin" "-vector-size" "2" "-vocab-file"
"vocab.txt" "-binary" "0"
(lldb) r
r
Process 3283 launched: '../../glove/build/glove' (x86_64)
TRAINING MODEL
Read 1090 lines.
Initializing parameters...done.
vector size: 2
vocab size: 33
x_max: 100.000000
alpha: 0.750000
iter: 001, cost: nan
iter: 002, cost: nan
iter: 003, cost: nan
iter: 004, cost: nan
iter: 005, cost: nan
iter: 006, cost: nan
iter: 007, cost: nan
iter: 008, cost: nan
iter: 009, cost: nan
iter: 010, cost: nan
iter: 011, cost: nan
iter: 012, cost: nan
iter: 013, cost: nan
iter: 014, cost: nan
iter: 015, cost: nan
iter: 016, cost: nan
iter: 017, cost: nan
Process 3283 stopped
* thread #2: tid = 0x62f0, 0x0000000100000f10 glove`glove_thread + 512, stop reason = EXC_BAD_ACCESS (code=1, address=0xffc01500)
    frame #0: 0x0000000100000f10 glove`glove_thread + 512
glove`glove_thread:
->  0x100000f10 <+512>: vmovsd (%rdx,%rcx,8), %xmm1
    0x100000f15 <+517>: vmulsd (%rax,%rcx,8), %xmm1, %xmm1
    0x100000f1a <+522>: vaddsd %xmm0, %xmm1, %xmm0
    0x100000f1e <+526>: incq   %rcx
(lldb)
ghost commented 8 years ago

Okay, @BenjaminHess thank you for the reproducable bug. It should be addressed by https://github.com/stanfordnlp/GloVe/commit/eb4714ec94eefc83f84d7fa32350b6ea6b5c3e8a. The issue is that tokenization of words -> indices for the babi corpus is failing in some locations, for a reason that I have yet to identify (perhaps the existence of a blank '' word.) Anyway, skipping word pairs where one word has index == 0 seems to fix the problem at hand for the time being. Word indices are assumed to start at 1 and the 0 index words were resulting in out-of-bounds memory access.