Open Archaneos opened 7 years ago
Just came across this, too. Each word vector in the word vector matrix is normalized to unit norm in the following piece of code:
# normalize each word vector to unit variance
W_norm = np.zeros(W.shape)
d = (np.sum(W ** 2, 1) ** (0.5))
W_norm = (W.T / d).T
The in-line comment is a bit misleading since the normalization here concerns the rows' norm and not their variance. Just submitted a tiny PR changing this: https://github.com/stanfordnlp/GloVe/pull/86
So that should mean that the dot product in the following should be equivalent to cosine similarity:
#cosine similarity if input W has been normalized
dist = np.dot(W, pred_vec.T)
Did you come across more such issues where the similarity is outside [-1,1] since your last comment?
I'm not an expert in linear algebra, but, in evaluate.py shouldn't the predicted vectors _predvec be normalized in order for the cosine similarity to be between -1 and 1? I was surprised to get values greater than 1 in some cases.