stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Memory error #92

Closed budhiraja closed 4 years ago

budhiraja commented 7 years ago

I am trying to train Glove on a high end server on a medium sized corpus but I am getting memory errors. Weird part: It comes after doing the first epoch. Here is the generated log:

BUILDING VOCABULARY Processed 62983105 tokens. Counted 2515374 unique words. Truncating vocabulary at min count 2. Using vocabulary of size 817196.

COUNTING COOCCURRENCES window size: 5 context: symmetric max product: 13752509 overflow length: 38028356 Reading vocab from file "/lustre/amar/glove_models/vocab.txt"...loaded 817196 words. Building lookup table...table contains 161333419 elements. Processed 62981822 tokens. Writing cooccurrences to disk...........3 files in total. Merging cooccurrence files: processed 66886759 lines.

SHUFFLING COOCCURRENCES array size: 255013683 Shuffling by chunks: processed 66886759 lines. Wrote 1 temporary file(s). Merging temp files: processed 66886759 lines.

TRAINING MODEL Read 66886759 lines. Initializing parameters...done. vector size: 50 vocab size: 817196 x_max: 10.000000 alpha: 0.750000 iter: 001, cost: 0.108257 glibc detected build/glove: double free or corruption (!prev): 0x00000000007e4540 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3058875366] /lib64/libc.so.6[0x3058877e93] /lib64/libc.so.6(fclose+0x14d)[0x3058865add] build/glove[0x4021b7] build/glove[0x4025f0] build/glove[0x402d34] /lib64/libc.so.6(__libc_start_main+0xfd)[0x305881ecdd] build/glove[0x400db9] ======= Memory map: ======== 00400000-00404000 r-xp 00000000 00:15 1653944383 /home/iiit/amar/Bug2Vec/glove/build/glove 00603000-00604000 rw-p 00003000 00:15 1653944383 /home/iiit/amar/Bug2Vec/glove/build/glove 007e3000-00804000 rw-p 00000000 00:00 0 [heap] 3058000000-3058020000 r-xp 00000000 fd:00 783778 /lib64/ld-2.12.so 305821f000-3058220000 r--p 0001f000 fd:00 783778 /lib64/ld-2.12.so 3058220000-3058221000 rw-p 00020000 fd:00 783778 /lib64/ld-2.12.so 3058221000-3058222000 rw-p 00000000 00:00 0 3058800000-3058989000 r-xp 00000000 fd:00 783779 /lib64/libc-2.12.so 3058989000-3058b88000 ---p 00189000 fd:00 783779 /lib64/libc-2.12.so 3058b88000-3058b8c000 r--p 00188000 fd:00 783779 /lib64/libc-2.12.so 3058b8c000-3058b8d000 rw-p 0018c000 fd:00 783779 /lib64/libc-2.12.so 3058b8d000-3058b92000 rw-p 00000000 00:00 0 3058c00000-3058c17000 r-xp 00000000 fd:00 783781 /lib64/libpthread-2.12.so 3058c17000-3058e17000 ---p 00017000 fd:00 783781 /lib64/libpthread-2.12.so 3058e17000-3058e18000 r--p 00017000 fd:00 783781 /lib64/libpthread-2.12.so 3058e18000-3058e19000 rw-p 00018000 fd:00 783781 /lib64/libpthread-2.12.so 3058e19000-3058e1d000 rw-p 00000000 00:00 0 3059400000-3059483000 r-xp 00000000 fd:00 783785 /lib64/libm-2.12.so 3059483000-3059682000 ---p 00083000 fd:00 783785 /lib64/libm-2.12.so 3059682000-3059683000 r--p 00082000 fd:00 783785 /lib64/libm-2.12.so 3059683000-3059684000 rw-p 00083000 fd:00 783785 /lib64/libm-2.12.so 305a800000-305a816000 r-xp 00000000 fd:00 783788 /lib64/libgcc_s-4.4.6-20120305.so.1 305a816000-305aa15000 ---p 00016000 fd:00 783788 /lib64/libgcc_s-4.4.6-20120305.so.1 305aa15000-305aa16000 rw-p 00015000 fd:00 783788 /lib64/libgcc_s-4.4.6-20120305.so.1 7fc048000000-7fc048021000 rw-p 00000000 00:00 0 7fc048021000-7fc04c000000 ---p 00000000 00:00 0 7fc04e099000-7fc04e09a000 ---p 00000000 00:00 0 7fc04e09a000-7fc09e280000 rw-p 00000000 00:00 0 7fc09e298000-7fc09e29a000 rw-p 00000000 00:00 0 7fff01705000-7fff0171a000 rw-p 00000000 00:00 0 [stack] 7fff017ff000-7fff01800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] ./demo.sh: line 43: 1678 Aborted (core dumped) $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE

phdowling commented 6 years ago

I get the same problem, but after training completes 100 iterations, and the model starts saving the vectors. Were you able to fix this?

AngledLuffa commented 4 years ago

We had some pull requests improving memory management recently. If anyone comes across a similar problem again, please reopen this bug or open a new one.