Open jozef-mokry opened 8 years ago
This probably does not make much difference, but I noticed that the read gates r1 and r2 in gru_cond_layer method are used slightly differently:
Here (https://github.com/nyu-dl/dl4mt-tutorial/blob/master/session3/nmt.py#L448) the hidden state is computed as:
h1 = tanh(xx_ + r1*(Ux*h)) where xx_ is Wx*state_below + bx [Notice that the read gate r1 is not applied onto the bias bx]
h1 = tanh(xx_ + r1*(Ux*h))
xx_
Wx*state_below + bx
r1
bx
However, when computing the second hidden state h2 at (https://github.com/nyu-dl/dl4mt-tutorial/blob/master/session3/nmt.py#L477) the hidden state is computed as:
h2 = tanh(Wcx*ctx_ + r2*(Ux_nl*h1 + bx_nl)) [Notice that the read gate r2 is applied onto the bias bx_nl] If r2 "kills" some dimensions of the bias term bx_nl then some decision hyperplanes of Wcx are forced to go through origin.
h2 = tanh(Wcx*ctx_ + r2*(Ux_nl*h1 + bx_nl))
r2
bx_nl
Wcx
Is this asymmetry intended?
This probably does not make much difference, but I noticed that the read gates r1 and r2 in gru_cond_layer method are used slightly differently:
Here (https://github.com/nyu-dl/dl4mt-tutorial/blob/master/session3/nmt.py#L448) the hidden state is computed as:
h1 = tanh(xx_ + r1*(Ux*h))
wherexx_
isWx*state_below + bx
[Notice that the read gater1
is not applied onto the biasbx
]However, when computing the second hidden state h2 at (https://github.com/nyu-dl/dl4mt-tutorial/blob/master/session3/nmt.py#L477) the hidden state is computed as:
h2 = tanh(Wcx*ctx_ + r2*(Ux_nl*h1 + bx_nl))
[Notice that the read gater2
is applied onto the biasbx_nl
] Ifr2
"kills" some dimensions of the bias termbx_nl
then some decision hyperplanes ofWcx
are forced to go through origin.Is this asymmetry intended?