stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

Bug with learning rate for Adagrad Optimizer? #134

Open hugo-chu opened 5 years ago

hugo-chu commented 5 years ago

Hi, I notice that in line 140 of glove.c, we do fdiff *= eta, which is then used in line 145 and 146 for the calculation of temp1 and temp2.

/* Adaptive gradient updates */
fdiff *= eta; // for ease in calculating gradient
real W_updates1_sum = 0;
real W_updates2_sum = 0;
for (b = 0; b < vector_size; b++) {
   // learning rate times gradient for word vectors
   temp1 = fdiff * W[b + l2];
   temp2 = fdiff * W[b + l1];
   // adaptive updates
   W_updates1[b] = temp1 / sqrt(gradsq[b + l1]);
   W_updates2[b] = temp2 / sqrt(gradsq[b + l2]);
   W_updates1_sum += W_updates1[b];
   W_updates2_sum += W_updates2[b];
   gradsq[b + l1] += temp1 * temp1;
   gradsq[b + l2] += temp2 * temp2;
}

However, since we also use temp1 and temp2 to calculate gradsq[b + l1], eta is squared along with the gradients, which means the following: screen shot 2018-11-23 at 5 01 25 pm such that the learning rate cancels out. What would then happen is for the first step of every word, we would learn the first coordinate with learning rate eta, and all subsequent coordinates with a learning rate of 1.0; after that the learning rate is 1.0 for all coordinates.

Same applies to the training of the biases.

Is this intentional?

Thanks, Hugo

Jurian commented 5 years ago

It seems to me that by including the learning rate, the squared gradient is smaller (a smaller number is squared). Therefore the updates will not shrink in size as fast as when the full gradient squared is used. Because the squared sum will not grow as fast.

Not exactly to Adagrad specification?

If I am wrong here, please correct me! :-)