Step size and gradient clipping for bias terms

stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings

Apache License 2.0

6.81k stars 1.51k forks source link

I added processing on the updates for the bias terms of the word vectors to mirror the other updates. Without these, the eta and grad-clip parameters do not function as described, and the loss function minimized is not quite the one that appears in the original paper.

In personal experiments, this does not seem to affect the final output of the code noticeably in most cases. It appears to only matter in certain edge cases where the original code fails to converge, such as when the co-occurence matrix contains many entries between 0 and 1.0.

stanfordnlp / GloVe

Step size and gradient clipping for bias terms #209