Asymmetry in read gate application

This probably does not make much difference, but I noticed that the read gates r1 and r2 in gru_cond_layer method are used slightly differently:

Here (https://github.com/nyu-dl/dl4mt-tutorial/blob/master/session3/nmt.py#L448) the hidden state is computed as:

h1 = tanh(xx_ + r1*(Ux*h)) where xx_ is Wx*state_below + bx [Notice that the read gate r1 is not applied onto the bias bx]

However, when computing the second hidden state h2 at (https://github.com/nyu-dl/dl4mt-tutorial/blob/master/session3/nmt.py#L477) the hidden state is computed as:

h2 = tanh(Wcx*ctx_ + r2*(Ux_nl*h1 + bx_nl)) [Notice that the read gate r2 is applied onto the bias bx_nl] If r2 "kills" some dimensions of the bias term bx_nl then some decision hyperplanes of Wcx are forced to go through origin.

Is this asymmetry intended?

nyu-dl / dl4mt-tutorial

Asymmetry in read gate application #68