stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Vocab_count accounts for empty strings #120

Open tarekeldeeb opened 6 years ago

tarekeldeeb commented 6 years ago

This particular line may hash empty strings while (fscanf(fid, format, str) != EOF) { // Insert all tokens into hashtable I found the following as a result in my vocab.txt

Arabic_Word 903 singleSpace 903 Arabic_Word 902

This empty string was the root cause to many further problems with coccur and GloVe. All proportional vectors beneath the empty line did not start with the string from vocab.txt but rather started with the number of occurrences! This simply ruins the vectors.txt.

devincornell commented 5 years ago

@tarekeldeeb Thank you! I copied that commit change into the original GloVe on my machine and it solved this issue for me :)