Closed PraveshKoirala closed 4 years ago
Are you using the current git head?
If so, is it possible to share the data? I'll delete it afterwards if you want.
@AngledLuffa , Yes I am using the current head. Here is a sample of the data (this, too causes segmentation fault so I think it really has something to do with the data itself): https://gist.github.com/PraveshKoirala/db0084637c403d0c3fef29cf48026185
The language is Nepali whereas the script is Devanagari which is used by a lot of South Asian languages including Hindi. The unicode block used by this script is https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)
I think I have discovered my mistake. It has nothing to do with Utf-8 encoding. I was just using the command wrong.
In the shuffling step, running shuffle < input > output
produces usage instructions and not the correct shuffled binary file. It seems that the shuffle command expects at least one input so entering shuffle -verbose 2 < input > output
works just fine.
I am so sorry to have troubled you @AngledLuffa
It's no trouble - doubly so since it's Sunday and I hadn't actually done any work on it yet. Glad it worked out.
It does seem that if
shuffle -verbose 2
works and
shuffle
doesn't work there's a bit of an issue. So please leave this issue open
I made it so that running shuffle with no arguments works as expected, and running with -help prints out the help message. Thanks!
I am encountering a Segmentation Fault when trying to build the model. All preprocessing steps execute flawlessly i.e. vocab generation, cooccurance as well as shuffling. The final build method, though, crashes. I am training the model in UTF-8 encoded text, so I am wondering if this is somehow related.
Here's how I am running the command:
-save-file models/gloves/processed_stemmed.glove -threads 8 -input-file /tmp/tmp4yuub3xg/cshuffleooccur.txt -iter 5 -vector-size 300 -binary 0 -vocab-file /tmp/tmp4yuub3xg/vocab.txt -verbose 2
And here is how the core dump looks like in gdb:
TRAINING MODEL Read 48 lines. Initializing parameters...Using random seed 1597553266 done. vector size: 300 vocab size: 541869 x_max: 100.000000 alpha: 0.750000 [New Thread 0x7ffec02a6700 (LWP 4689)] [New Thread 0x7ffebfaa5700 (LWP 4690)] [New Thread 0x7ffebf2a4700 (LWP 4691)]
Thread 2 "glove" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffec02a6700 (LWP 4689)] 0x0000555555555dea in glove_thread () (gdb) where
0 0x0000555555555dea in glove_thread ()
1 0x00007ffff781f6db in start_thread (arg=0x7ffec02a6700)
2 0x00007ffff7548a3f in clone ()