In the code (cooccur.c and glove.c) there are some build-in tests for repeated entries in the vocabulary.
Also long entries are detected.
This however leads to very strange and questionable results.
As an example I provide a toy vocabulary 'voc' and a toy text 'txt'
voc.ziptxt.zip
The commands I ran were:
./build/cooccur -window-size 3 -vocab-file voc < txt > bin./build/shuffle -verbose 1 < bin > shuf./build/glove -vocab-file voc -input-file shuf
The resulting vectors are wrong, imho:
There are 3 different vectors for 'mies' and there are 2 vectors for parts of the overly long 'klaasklaas....' entry
The problems are in cooccur.c, where the vocabulary is read,and the words are hashed.
when hashinsert fails, the id is still incremented. (via ++j)
when reading fails due to an overly long entry, 1000 letters are hashed and we try again with the remainder, splitting the long word in several words.
(using while (fscanf(fid, format, str, &id) != EOF) is an anti-pattern anyway, fscanf doesn't return EOF)
SO: the vocabulary reading needs improvement. e.g. by skipping multiple and overly long entries totally.
BUT THEN:
This leads to a smaller vocabulary size then the files size. In glove.c the vocabulary size is first counted as the size of the file. Which would be wrong then, AND the vectors are written using words stemming from the same wrong reading loop as in cooccur.c. Otherwise things might get out-of-sync
So this needs rework. glove.c should use exact the same reading logic as cooccur.c
Storing the true vocabulary size in the .bin file might be a good idea?
In the code (cooccur.c and glove.c) there are some build-in tests for repeated entries in the vocabulary. Also long entries are detected. This however leads to very strange and questionable results. As an example I provide a toy vocabulary 'voc' and a toy text 'txt' voc.zip txt.zip
The commands I ran were:
./build/cooccur -window-size 3 -vocab-file voc < txt > bin
./build/shuffle -verbose 1 < bin > shuf
./build/glove -vocab-file voc -input-file shuf
The resulting vectors are wrong, imho: There are 3 different vectors for 'mies' and there are 2 vectors for parts of the overly long 'klaasklaas....' entry
The problems are in cooccur.c, where the vocabulary is read,and the words are hashed.
while (fscanf(fid, format, str, &id) != EOF)
is an anti-pattern anyway, fscanf doesn't return EOF)SO: the vocabulary reading needs improvement. e.g. by skipping multiple and overly long entries totally. BUT THEN: This leads to a smaller vocabulary size then the files size. In glove.c the vocabulary size is first counted as the size of the file. Which would be wrong then, AND the vectors are written using words stemming from the same wrong reading loop as in cooccur.c. Otherwise things might get out-of-sync
So this needs rework. glove.c should use exact the same reading logic as cooccur.c
Storing the true vocabulary size in the .bin file might be a good idea?