[QUESTION] Attention Flow in Decoder models

Hi all,

It is not an issue per see, but a question of a specific point of your paper that I would like a little bit more clarification.

Firstly and foremost, I enjoyed reading your paper :) I like it so much that it has become one of the leading resources for my master's degree research project.

I'm trying to implement Attention Flow in a Decoder setting (T5, to be more specific). Quoting the following paragraph from the paper:

Since in Transformer decoder, future tokens are masked, naturally there is more attention toward initial tokens in the input sequence, and both attention rollout and attention flow will be biased toward these tokens. Hence, to apply these methods on a Transformer decoder, we should first normalize based on the receptive field of attention

What do you mean by receptive field of attention here? Is it the 1D-convolution size based on the image below?

By the way, would this be a problem if I rely on cross-attention to compute the attention flow in the case of the decoder? I'm asking it because, if I'm not wrong, I would only have access to decoder tokens that are predicted at each time step. For instance, at timestamp $t_0$ the self-attention would be 1 because there would only exist a single token. If I use cross-attention, then I could get information from both the encoder and decoder (please notice FYI that I'm using transformers library)

Thank you very much in advance!

Best,

samiraabnar / attention_flow

[QUESTION] Attention Flow in Decoder models #6