Bug with learning rate for Adagrad Optimizer?

Hi, I notice that in line 140 of glove.c, we do fdiff *= eta, which is then used in line 145 and 146 for the calculation of temp1 and temp2.

/* Adaptive gradient updates */
fdiff *= eta; // for ease in calculating gradient
real W_updates1_sum = 0;
real W_updates2_sum = 0;
for (b = 0; b < vector_size; b++) {
   // learning rate times gradient for word vectors
   temp1 = fdiff * W[b + l2];
   temp2 = fdiff * W[b + l1];
   // adaptive updates
   W_updates1[b] = temp1 / sqrt(gradsq[b + l1]);
   W_updates2[b] = temp2 / sqrt(gradsq[b + l2]);
   W_updates1_sum += W_updates1[b];
   W_updates2_sum += W_updates2[b];
   gradsq[b + l1] += temp1 * temp1;
   gradsq[b + l2] += temp2 * temp2;
}

However, since we also use temp1 and temp2 to calculate gradsq[b + l1], eta is squared along with the gradients, which means the following: screen shot 2018-11-23 at 5 01 25 pm such that the learning rate cancels out. What would then happen is for the first step of every word, we would learn the first coordinate with learning rate eta, and all subsequent coordinates with a learning rate of 1.0; after that the learning rate is 1.0 for all coordinates.

Same applies to the training of the biases.

Is this intentional?

Thanks, Hugo

stanfordnlp / GloVe

Bug with learning rate for Adagrad Optimizer? #134