aerinkim commented 6 years ago

I'm trying to train Glove on a pretty big dataset, the newest wikidump (22G txt file). The total # of vocab that I'm training is 1.7 mil. Every file (shuffle, cooccur, vocab_count) until glove runs smoothly without any memory error. (My RAM = 64G)

However, when I ran glove, I'm getting "Segmentation fault (core dumped)".

aerin@capa:~/Desktop/GloVe/build$ ./glove -input-file cooccurrence.shuf.bin -vocab-file vocab.txt -save-file glove300 -t-iter 25  -gradsq-file gradsq -verbose 2 -vector-size 300 -threads 1 -alpha 0.75 -x-max 100.0 -eta 0.05 -binary 2 -model 2
TRAINING MODEL
Read 1939406304 lines.
Initializing parameters...done.
vector size: 300
vocab size: 1737888
x_max: 100.000000
alpha: 0.750000
Segmentation fault (core dumped)

I tried with different # of threads as well: 1,2,4,8,16,32, etc. Nothing runs. Can someone please point me where to look? Thanks for this repository!

Update

I cut the number of vocabulary from 1.7 million to 1 million and glove.c runs without "segmentation fault" error. So it is a memory error. But I would love to learn how to resolve this error and be able to train a model on the larger dataset! Any comment will be highly valued. Thanks.

pjox commented 6 years ago

I am not sure It is a memory error, I am getting the exact same error on a cluster with 3Tb of RAM at my disposal. However I do agree that any comment on how to solve this would be highly appreciated!

WiraDKP commented 5 years ago

I hope that my comments could help. In my case, a small dataset (1.6 MB) has this problem, but a larger data (600MB) doesn't. So I think memory is not the problem, but somehow related to the content of the data, which will explain why cutting your data works. @byorxyz

BetterZhouXu commented 5 years ago

Just download the latest version from github, problem will be solved. Do not download glove from stanfold glove homepage.

SeyedMST commented 5 years ago

I have the same problem on my 150GB corpus, in running glove.c (I have downloaded glove from github). I have 60 GB Memory and the program does not allocate more than 3 GB.

dtjsamrat625 commented 5 years ago

Just download the latest version from github, problem will be solved. Do not download glove from stanfold glove homepage.

Still persists to me.

honnibal commented 4 years ago

I spent a while debugging a segfault that I thought was this problem, but I'd actually passed a directory as the -vocab-file argument (it was late...). So for future readers, double check your arguments more closely to make sure it's not a simple problem. I'll make a PR to catch the error-case I hit.

FWIW I spent some time in the glove.c trying to find where something might be going wrong if the cooccurrences file was too large. My only guess was that there might be an integer overflow? I doubt it though.

If you do hit a segfault, debugging should be quite easy. Set -threads 1 and start adding print statements to the glove.c file. I would suggest trying to print the index into the W array on line 127, like this:

       for (b = 0; b < vector_size; b++) {
           fprintf(stderr,"Accessing W. b+l1=%lld, b+l2=%lld\n", b+l1, b+l2);
           diff += W[b + l1] * W[b + l2]; // dot product of word and context word vector
        }

It's a beautifully short program, so I'm sure if the failing line is identified it can be fixed easily.

josephwccheng commented 1 year ago

I also had this problem, but I had an approach to solve it.

Analysising my problem:

I did a comparison on running the text8 file verses my own file and realised that:
- i had alot more unique values than text8 has
- i included alot of random characters (i.e. full numbers, random string) which was not taken care of properly

Solution

when creating the corpus, create a more detailed tokenisation techique such that unwanted values are removed from the corpus.

Result Successfully built my own glove Vector

Good luck!

stanfordnlp / GloVe

Segmentation fault (core dumped) in Glove.c #123

Update