Closed Justin-Tan closed 7 years ago
Hi Justin, Thanks for having such a close look.
The function linear() takes a list of tensors (among other things) as input. These inputs are then stacked. Likewise a matrix is allocated to compute the matrix vector product of the inputs and the weight matrices. To save computation time, both Wx[t] and Ry[t-1] are computed simultaneously as follows: out = [W,R]*[x,y] In this case, the matrix allocated in linear() is [W,R]. The associated inputs are stacked to make this construction work. Please note that x[t] and y[t-1] are not processed by the same weights (as far as I can see). Thanks again for your feedback.
Please let me know if there is anything else that should be made clearer in the code.
I was looking at the TF implementation and it seems that you use the same weights for the input x^{[t]} as well as the hidden state s^{[t]} in the function linear(...), but in the paper, Eqs. 7,8,9, they are labelled as distinct matrices - W and R?
Let me know if I'm missing something.
Great paper, by the way, early results seem to be competitive with deep bidirectional GRUs for sequence classification.