stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

Allow glove.c to output context vectors #219

Closed atkindel closed 11 months ago

atkindel commented 1 year ago

It would be helpful to be able to use the word vectors and context vectors separately in some applications. In particular, per Levy, Goldberg & Dagan ("Improving distributional similarity with lessons learned from word embeddings", 2015 TACL; https://doi.org/10.1162/tacl_a_00134), the angle between a word vector and a context vectors has a different meaning than the angle between two word vectors, and users may wish to differentiate between these depending on what they are trying to do.

The .txt output option in glove.c is modified to allow setting model=3; this setting row-concatenates the context vectors rather than adding them to the word vectors. Perhaps it's preferable to either column-concatenate or save the context vectors to a separate file? If there is sufficient interest, I'm happy to rewrite it.

AngledLuffa commented 11 months ago

Sorry for the slowness. I'm not nearly as familiar with glove as with the other software I'm working on, so I try to avoid it, but there aren't many of us still here working on it. If you still want this merged, I'm happy to merge it

AngledLuffa commented 11 months ago

alright, i merged it, but with this caveat. actually i can thank the compiler for finding that, and present me not finding it myself when reviewing as evidence of my C rustiness and my general motivation for not touching this code more than necessary

https://github.com/stanfordnlp/GloVe/commit/a577eeeb8074f2c362fa7738143214eca9cb414f

atkindel commented 10 months ago

Ah, thank you!! I wasn't sure how active the maintenance on this was so I figured I'd just come back and reopen it if there was need/interest, but you beat me to it :) and thanks very much for catching the bug, I'm also very rusty with C...