Closed helson73 closed 7 years ago
I think you have it the other way around: every batch has the same source length, but potentially different target lengths. We set the weight of the blank symbol in criterion to be zero, so we do not receive any gradients on the target side if target_output[t] = blank symbol (which has index 1).
For bi-lstm, we add because it's simple. Alternatively you could concatenate, but this will require some fiddling around with the rnn sizes.
@yoonkim Thanks! I found it.
Target side is sorted, so every target sentence in one batch has same length. But in source side, sentence lengths would be vary, and it seems like no schemes to block "blank" source words. Even if "blank" embedding is set zero, output of blank positions would generate some value because of recurrence in LSTMs. Why this issue be ignored? P.S. when bi-lstm is used, backward lstm's context is just added to forward one, is there any special reason for this?