We wanted to see if GloVe's code could be reused for an idea we had at work, and so I started looking at cooccur.c. C code can be pretty low-level, so I tried sanity checking my intuition of what is and should be going on with some fprintf calls and some small examples.
I think that the first w1/w2 pair is being double counted. The problem seems to be within the merge_files call, specifically in merge_write. I'm still thinking through the "priority queue" part of the algorithm, but when I removed one line (https://github.com/stanfordnlp/GloVe/blob/master/src/cooccur.c#L217) I got the counts I was expecting. The example I was using was super small (vocabulary of size 2) and symmetric ("a a b b"), so it didn't make sense to get assymetric results.
Does it make sense that removing this line might fix a defect? My gut feeling is that people just haven't noticed because it seems to only affect on line's counts, and typical corpa have tons of word pairs.
I'm happy to provide more information if it helps.
Hi,
We wanted to see if GloVe's code could be reused for an idea we had at work, and so I started looking at
cooccur.c
. C code can be pretty low-level, so I tried sanity checking my intuition of what is and should be going on with somefprintf
calls and some small examples.I think that the first w1/w2 pair is being double counted. The problem seems to be within the
merge_files
call, specifically inmerge_write
. I'm still thinking through the "priority queue" part of the algorithm, but when I removed one line (https://github.com/stanfordnlp/GloVe/blob/master/src/cooccur.c#L217) I got the counts I was expecting. The example I was using was super small (vocabulary of size 2) and symmetric ("a a b b"), so it didn't make sense to get assymetric results.Does it make sense that removing this line might fix a defect? My gut feeling is that people just haven't noticed because it seems to only affect on line's counts, and typical corpa have tons of word pairs.
I'm happy to provide more information if it helps.