self._encoderDecoderAttention defined but not used

maxjcohen / transformer

Implementation of Transformer model (originally from Attention is All You Need) applied to Time Series.

https://timeseriestransformer.readthedocs.io/en/latest/

GNU General Public License v3.0

842 stars 165 forks source link

self._encoderDecoderAttention defined but not used #23

Closed kryptonite0 closed 3 years ago

kryptonite0 commented 3 years ago

Hi Max, thank you for sharing this amazing repo. I have a question regarding the architecture. In the Decoder module (https://github.com/maxjcohen/transformer/blob/master/tst/decoder.py) you define a self._encoderDecoderAttention (L62) that you do not seem to use in the forward method. Instead, in the forward method both the self-attention and the encoder-decoder attention use the same layer self._selfAttention. Is this a bug maybe? I compared the architecture with this https://nlp.seas.harvard.edu/2018/04/03/attention.html and I would guess that the encoder-decoder attention should be done using self._encoderDecoderAttention. What do you think? Can I also please ask you to tell me if you wrote the architecture yourself from scratch? Thanks in advance! :)

maxjcohen commented 3 years ago

Hi, you're right, there something fishy about this forward definition.

In the original Transformer architecture, the decoder stacks a self attention block, followed by a encoder-decoder attention block, where values for Keys and Values (K & V) are computed from the latent vector.
From an implementation point of vue, I developed all attention blocks to handle both self and encoder-decoder attention, as you can see in lines 61-21 : both attention block are defined in the exact same way. Since they only differ at inference time, I believe using the same attention block for both isn't a problem.
So why the fishyness ? The code may compute the right operations, it stills looks like a bug. I guess I originally planed to use two different variables to store both attention blocks, and ended up using the same, while forgetting to remove the old one. I'll try to fix this soon.

I wrote the architecture using the original paper, as well as a few repos that already implemented the Transformer in pytorch and keras. I also used The Annotated Transformer that you linked, and the Illustrated Transformer. Today, there is a Transformer model directly implemented in PyTorch.

kryptonite0 commented 3 years ago

Hi Max, thanks a lot for your answer. What I'm still confused with is the fact that if you use the same instance of a layer twice in the forward method, you are effectively using the same weights twice in an RNN fashion. Is this correct and planned?

maxjcohen commented 3 years ago

Yes, yes, i was so focused on searching for other variables inside the MHA class that could be saved between both attention inferences, that I forgot weights were a thing. Fixed in 240f8fbc2c20fac4182239fa70a6181c7120bd3c, thanks a lot !

kryptonite0 commented 3 years ago

Thanks a lot to you :))