stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

demo.sh fails at shuffling with segmentation fault #6

Closed hrzafer closed 8 years ago

hrzafer commented 8 years ago

The demo.sh fails on a virtual Linux Mint 17.1 (based on Ubuntu 14.04) on VirtualBox (with 4GB ram).

...
Merging cooccurrence files: processed 60666466 lines.

SHUFFLING COOCCURRENCES
array size: 255013683
Shuffling by chunks: processed 0 lines../demo.sh: line 55:  2613 Segmentation fault   
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
ghost commented 8 years ago

Thanks for reporting these. Will try to reproduce.

hrzafer commented 8 years ago

I also get this error on virtual xubuntu 14.04 in addition to the compiler warnings. Linux Mint 17.1 on VirtualBox compiles with no warning.

FYI actually I'm a Windows user and thankfully I managed to compile and execute Glove with Cygwin (on Windows 10) without any problems.

ghost commented 8 years ago

Sorry for the slow reply. Meant to respond to this earlier after merging https://github.com/stanfordnlp/GloVe/pull/3. Do you mind trying this again and seeing if you get a segmentation fault after the fix? Perhaps if you do you could try running with valgrind --tool=memcheck $BUILDDIR/shuffle ... and paste the result?

ghost commented 8 years ago

I'm going to assume this was solved by our memory pull request fixes. Please reopen if you are able to reproduce on the most recent version.

hrzafer commented 8 years ago

Sorry for late feedback. I use virtualbox 5.0.10, windows 10 as host, LinuxMint 17.1 (2 GB RAM, gcc version: 4.8) as guest. I just downloaded the latest version as zip and extracted. This is the my terminal output:

harun@harun-mint ~/Desktop/GloVe-master $ make
mkdir -p build
gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
harun@harun-mint ~/Desktop/GloVe-master $ ./demo.sh 
mkdir -p build
gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
--2015-11-22 14:24:24--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 98.139.135.129
Connecting to mattmahoney.net (mattmahoney.net)|98.139.135.129|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’

100%[======================================>] 31.344.016   678KB/s   in 60s    

2015-11-22 14:25:25 (508 KB/s) - ‘text8.zip’ saved [31344016/31344016]

Archive:  text8.zip
  inflating: text8                   
BUILDING VOCABULARY
Processed 17005207 tokens.
Counted 253854 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 71290.

COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 71290 words.
Building lookup table...table contains 94990279 elements.
Processing token: 15100000./demo.sh: line 55:  2710 Killed                  $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE

I increase the main memory to 4 GB and retry:

harun@harun-mint ~/Desktop/GloVe-master $ ./demo.sh 
mkdir -p build
gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
BUILDING VOCABULARY
Processed 17005207 tokens.
Counted 253854 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 71290.

COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 71290 words.
Building lookup table...table contains 94990279 elements.
Processed 17005206 tokens.
Writing cooccurrences to disk.........2 files in total.
Merging cooccurrence files: processed 60666466 lines.

SHUFFLING COOCCURRENCES
array size: 255013683
Shuffling by chunks: processed 0 lines../demo.sh: line 55:  2294 Segmentation fault      $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
asnatm commented 8 years ago

Hi, Did you manage to solve the problem? I have the same issue with segmentation fault. Thanks!

ghost commented 8 years ago

@asnatm I'm currently working on a separate segfault issue in build/glove in https://github.com/stanfordnlp/GloVe/issues/14. Can you clarify the circumstances under which you're getting the segfault (i.e. in glove.cpp or shuffle.cpp, and using which corpus).

asnatm commented 8 years ago

I'm getting the segfault when running the demo script. At the shuffling by chunks step (shuffle.cpp) I'm using my own corpus. A small one only 23405 words in vocab. thanks!

ghost commented 8 years ago

I understand that you may not be able to share the corpus. But it's pretty difficult for me to help if I can't reproduce locally. Could you try cutting the corpus size in half repeatedly until you don't get the problem any more, and then sending me the smallest such corpus, perhaps with words substituted out for numbers representing their indices in the vocab?

asnatm commented 8 years ago

Hi,

This is a link to the corpus https://www.dropbox.com/s/vqa6ogtxipy2lvz/myCorpus.txt?dl=0 (too large to attach) .Is the format ok? I'll do the exercise you suggested and send you the smallest corpus I still got the problem.

Highly appreciated,

Thanks, Asi

On Tue, Feb 16, 2016 at 6:54 AM, Russell Stewart notifications@github.com wrote:

I understand that you may not be able to share the corpus. But it's pretty difficult for me to help if I can't reproduce locally. Could you try cutting the corpus size in half repeatedly until you don't get the problem any more, and then sending me the smallest such corpus, perhaps with words substituted out for numbers representing their indices?

— Reply to this email directly or view it on GitHub https://github.com/stanfordnlp/GloVe/issues/6#issuecomment-184518457.

asnatm commented 8 years ago

Hi,

I solved it. I didn't have enough memory.... sorry

Thanks, Asi

On Tue, Feb 16, 2016 at 10:00 AM, Asi Messica asi.messica@gmail.com wrote:

Hi,

This is a link to the corpus https://www.dropbox.com/s/vqa6ogtxipy2lvz/myCorpus.txt?dl=0 (too large to attach) .Is the format ok? I'll do the exercise you suggested and send you the smallest corpus I still got the problem.

Highly appreciated,

Thanks, Asi

On Tue, Feb 16, 2016 at 6:54 AM, Russell Stewart <notifications@github.com

wrote:

I understand that you may not be able to share the corpus. But it's pretty difficult for me to help if I can't reproduce locally. Could you try cutting the corpus size in half repeatedly until you don't get the problem any more, and then sending me the smallest such corpus, perhaps with words substituted out for numbers representing their indices?

— Reply to this email directly or view it on GitHub https://github.com/stanfordnlp/GloVe/issues/6#issuecomment-184518457.

dutkaD commented 6 years ago

Might sound stupid, but for me, it solved the problem just by restarting the computer :snail:

yousefihsm commented 5 years ago

Hi all. I faced up with some problem. The problem is the memory you can adjust processing memory according to your computer memory.

ghost commented 5 years ago

bummer

On Sun, Oct 6, 2019 at 11:22 PM hashem yousefi notifications@github.com wrote:

Hi all. I faced up with some problem. The problem is the memory you can adjust processing memory according to your computer memory.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/6?email_source=notifications&email_token=AAIFEMIVNF22GQJ5AUWIDCLQNLIRNA5CNFSM4BTOWYAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAPETMI#issuecomment-538855857, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIFEMPNEIXKU2DHOMK6DC3QNLIRNANCNFSM4BTOWYAA .