stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Predicted vectors not normalized? #76

Open Archaneos opened 7 years ago

Archaneos commented 7 years ago

I'm not an expert in linear algebra, but, in evaluate.py shouldn't the predicted vectors _predvec be normalized in order for the cosine similarity to be between -1 and 1? I was surprised to get values greater than 1 in some cases.

JungeAlexander commented 7 years ago

Just came across this, too. Each word vector in the word vector matrix is normalized to unit norm in the following piece of code:

# normalize each word vector to unit variance
W_norm = np.zeros(W.shape)
d = (np.sum(W ** 2, 1) ** (0.5))
W_norm = (W.T / d).T

The in-line comment is a bit misleading since the normalization here concerns the rows' norm and not their variance. Just submitted a tiny PR changing this: https://github.com/stanfordnlp/GloVe/pull/86

So that should mean that the dot product in the following should be equivalent to cosine similarity:

#cosine similarity if input W has been normalized
dist = np.dot(W, pred_vec.T)

Did you come across more such issues where the similarity is outside [-1,1] since your last comment?