stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.9k stars 1.52k forks source link

Bug in coocurrence calculation? #65

Open shajra opened 7 years ago

shajra commented 7 years ago

Hi,

We wanted to see if GloVe's code could be reused for an idea we had at work, and so I started looking at cooccur.c. C code can be pretty low-level, so I tried sanity checking my intuition of what is and should be going on with some fprintf calls and some small examples.

I think that the first w1/w2 pair is being double counted. The problem seems to be within the merge_files call, specifically in merge_write. I'm still thinking through the "priority queue" part of the algorithm, but when I removed one line (https://github.com/stanfordnlp/GloVe/blob/master/src/cooccur.c#L217) I got the counts I was expecting. The example I was using was super small (vocabulary of size 2) and symmetric ("a a b b"), so it didn't make sense to get assymetric results.

Does it make sense that removing this line might fix a defect? My gut feeling is that people just haven't noticed because it seems to only affect on line's counts, and typical corpa have tons of word pairs.

I'm happy to provide more information if it helps.

ghost commented 7 years ago

Try submitting a pull request, and we'll see how this impacts performance with the Travis integration tests.

onatandroid commented 3 years ago

I have the same problem.