Confusion in 'Memory for KV' in SelfAttentionLayer

rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks

http://returnn.readthedocs.io/

Other

347 stars 130 forks source link

Confusion in 'Memory for KV' in SelfAttentionLayer #254

Closed ashu5644 closed 4 years ago

ashu5644 commented 4 years ago

I was looking in SelfAttentionLayer code, going through how it is implemented, almost got all things including relative positional encoding but not clear in memory part? I am confused about type of memory being reused. Is memory refers layer Lth outputs of a particular sequence from a data batch being used in corresponding sequence of next data batch as input to (L+1)th layer or some other kind of memory? How it should be specified for use in returnn config?

albertz commented 4 years ago

Note that the SelfAttentionLayer (like many other layers, e.g. RnnCellLayer) can run in two modes:

Either it runs on the whole sequence. This is the case in the encoder of a Transformer. Or also in the decoder, with option attention_left_only, during training. In this case, it has no memory/state.
Or it runs step-by-step, during decoding. In this case, in every step, it will only see / get in a single new frame. To access the previous frames, i.e. to be able to attend to the full history, it needs to remember this. This is the memory / hidden state of the layer.

@kazuki-irie I think the figure you made, which was in your slides for the Transformer LM, is helpful for understanding this. Maybe you can add the slides to the i6 publications, to the existing paper? (I think this is possible somehow...)

kazuki-irie commented 4 years ago

I have uploaded the requested slides now.

They can be found on our publication website https://www-i6.informatik.rwth-aachen.de/web/Publications/index.html by searching for "Language Modeling with Deep Transformers" and the link to the slides [slides] is appended to the end of the line. Direct link.

The figure @albertz is referring to is on page 4. This figure was not meant to be an illustration of the code, but it should at least show what the KV-memory/state in a self-attention layer is.

ashu5644 commented 4 years ago

thanks for clarifying & slides. so for using transformer trained LM during fusion with e2e model, in default behaviour with current fusion script (https://github.com/rwth-i6/returnn-experiments/blob/master/2019-lm-transformers/librispeech/bpe_10k/base3.retrain2.transfo_lm.fusion_eval4.gamma1.1.config) it will attend only to current token in LM, & doesn't attend to previous decoded symbols during fusion i.e self attention will not give any weights to previous symbols, am I right? how to specify it to attend to all previous decoded tokens in fusion config?

kazuki-irie commented 4 years ago

No, that is not correct. It does attend to all predecessor token positions and the current one.

Please have a closer look at the figure (or check how self-attention is defined: this is independent of RETURNN): attention is computed using the query vector q_n^{(\ell)} for the current input and all predecessor h_(n-1)^{(\ell)} and current key k_n^{(\ell)} and value v_n^{(\ell)} vectors. h_(n-1)^{(\ell)} contains key and value vectors for all predecessor (i.e. up to position (n-1)) token positions.

ashu5644 commented 4 years ago

Oh, ok. I was confused with attention term of transformer, that you mentioned for decoding phase procedure. So it will attend to all previous inputs but input for a particular step itself will be only from current token not concatenation of previous inputs.

ashu5644 commented 4 years ago

I thought memory to be of type transofrmer-xl one, as self attention code already have relative position encoding so it will need xl style memory implementation and some data pipeline change for transformer xl architecture.

albertz commented 4 years ago

Is this clear now? Any suggestion how to improve the documentation? Otherwise I will close this now.

ashu5644 commented 4 years ago

yes, it is clear now.