stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

Glove segmentation fault #181

Closed PraveshKoirala closed 4 years ago

PraveshKoirala commented 4 years ago

I am encountering a Segmentation Fault when trying to build the model. All preprocessing steps execute flawlessly i.e. vocab generation, cooccurance as well as shuffling. The final build method, though, crashes. I am training the model in UTF-8 encoded text, so I am wondering if this is somehow related.

Here's how I am running the command: -save-file models/gloves/processed_stemmed.glove -threads 8 -input-file /tmp/tmp4yuub3xg/cshuffleooccur.txt -iter 5 -vector-size 300 -binary 0 -vocab-file /tmp/tmp4yuub3xg/vocab.txt -verbose 2

And here is how the core dump looks like in gdb:

TRAINING MODEL Read 48 lines. Initializing parameters...Using random seed 1597553266 done. vector size: 300 vocab size: 541869 x_max: 100.000000 alpha: 0.750000 [New Thread 0x7ffec02a6700 (LWP 4689)] [New Thread 0x7ffebfaa5700 (LWP 4690)] [New Thread 0x7ffebf2a4700 (LWP 4691)]

Thread 2 "glove" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffec02a6700 (LWP 4689)] 0x0000555555555dea in glove_thread () (gdb) where

0 0x0000555555555dea in glove_thread ()

1 0x00007ffff781f6db in start_thread (arg=0x7ffec02a6700)

at pthread_create.c:463

2 0x00007ffff7548a3f in clone ()

at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
AngledLuffa commented 4 years ago

Are you using the current git head?

If so, is it possible to share the data? I'll delete it afterwards if you want.

PraveshKoirala commented 4 years ago

@AngledLuffa , Yes I am using the current head. Here is a sample of the data (this, too causes segmentation fault so I think it really has something to do with the data itself): https://gist.github.com/PraveshKoirala/db0084637c403d0c3fef29cf48026185

The language is Nepali whereas the script is Devanagari which is used by a lot of South Asian languages including Hindi. The unicode block used by this script is https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)

PraveshKoirala commented 4 years ago

I think I have discovered my mistake. It has nothing to do with Utf-8 encoding. I was just using the command wrong. In the shuffling step, running shuffle < input > output produces usage instructions and not the correct shuffled binary file. It seems that the shuffle command expects at least one input so entering shuffle -verbose 2 < input > output works just fine.

I am so sorry to have troubled you @AngledLuffa

AngledLuffa commented 4 years ago

It's no trouble - doubly so since it's Sunday and I hadn't actually done any work on it yet. Glad it worked out.

It does seem that if

shuffle -verbose 2

works and

shuffle

doesn't work there's a bit of an issue. So please leave this issue open

AngledLuffa commented 4 years ago

I made it so that running shuffle with no arguments works as expected, and running with -help prints out the help message. Thanks!