Closed nilavghosh closed 4 years ago
@nilavghosh I have the same question "Are these implementation fundamentally the same?"
Hi, Yes this is a variant of the original approach. But it is not either an uncommon approach. However, I'm not sure about the performance difference.
The main reason I picked this approach is because it makes more sense to me have these attention outputs closer to the output layer as the output layer (as opposed to having them at decoder inputs) is what's going to make the final decision.
https://github.com/thushv89/attention_keras/blob/f7c6f40cb207431d0229c38992eb93ad17d38e20/examples/nmt/model.py#L30
Is the implementation here a variation of the Bahdanau attention paper?. As per the paper during training the alignment vector is concatenated with the embedded target of the previous timestep then this vector is supplied to the decoder.
https://github.com/thushv89/attention_keras/blob/f7c6f40cb207431d0229c38992eb93ad17d38e20/examples/nmt/model.py#L35 In the code base here, this concatenated vector is directly 'softmaxed' to get the predicted output.
Are these implementation fundamentally the same?