Closed kryptonite0 closed 3 years ago
Hi, you're right, there something fishy about this forward
definition.
I wrote the architecture using the original paper, as well as a few repos that already implemented the Transformer in pytorch and keras. I also used The Annotated Transformer that you linked, and the Illustrated Transformer. Today, there is a Transformer model directly implemented in PyTorch.
Hi Max, thanks a lot for your answer. What I'm still confused with is the fact that if you use the same instance of a layer twice in the forward method, you are effectively using the same weights twice in an RNN fashion. Is this correct and planned?
Yes, yes, i was so focused on searching for other variables inside the MHA class that could be saved between both attention inferences, that I forgot weights were a thing. Fixed in 240f8fbc2c20fac4182239fa70a6181c7120bd3c, thanks a lot !
Thanks a lot to you :))
Hi Max, thank you for sharing this amazing repo. I have a question regarding the architecture. In the
Decoder
module (https://github.com/maxjcohen/transformer/blob/master/tst/decoder.py) you define aself._encoderDecoderAttention
(L62) that you do not seem to use in theforward
method. Instead, in theforward
method both the self-attention and the encoder-decoder attention use the same layerself._selfAttention
. Is this a bug maybe? I compared the architecture with this https://nlp.seas.harvard.edu/2018/04/03/attention.html and I would guess that the encoder-decoder attention should be done usingself._encoderDecoderAttention
. What do you think? Can I also please ask you to tell me if you wrote the architecture yourself from scratch? Thanks in advance! :)