Closed ashu5644 closed 4 years ago
Note that the SelfAttentionLayer (like many other layers, e.g. RnnCellLayer) can run in two modes:
attention_left_only
, during training. In this case, it has no memory/state.@kazuki-irie I think the figure you made, which was in your slides for the Transformer LM, is helpful for understanding this. Maybe you can add the slides to the i6 publications, to the existing paper? (I think this is possible somehow...)
I have uploaded the requested slides now.
They can be found on our publication website https://www-i6.informatik.rwth-aachen.de/web/Publications/index.html by searching for "Language Modeling with Deep Transformers" and the link to the slides [slides] is appended to the end of the line. Direct link.
The figure @albertz is referring to is on page 4. This figure was not meant to be an illustration of the code, but it should at least show what the KV-memory/state in a self-attention layer is.
thanks for clarifying & slides. so for using transformer trained LM during fusion with e2e model, in default behaviour with current fusion script (https://github.com/rwth-i6/returnn-experiments/blob/master/2019-lm-transformers/librispeech/bpe_10k/base3.retrain2.transfo_lm.fusion_eval4.gamma1.1.config) it will attend only to current token in LM, & doesn't attend to previous decoded symbols during fusion i.e self attention will not give any weights to previous symbols, am I right? how to specify it to attend to all previous decoded tokens in fusion config?
No, that is not correct. It does attend to all predecessor token positions and the current one.
Please have a closer look at the figure (or check how self-attention is defined: this is independent of RETURNN): attention is computed using the query vector q_n^{(\ell)}
for the current input and all predecessor h_(n-1)^{(\ell)}
and current key k_n^{(\ell)}
and value v_n^{(\ell)}
vectors.
h_(n-1)^{(\ell)}
contains key and value vectors for all predecessor (i.e. up to position (n-1)) token positions.
Oh, ok. I was confused with attention term of transformer, that you mentioned for decoding phase procedure. So it will attend to all previous inputs but input for a particular step itself will be only from current token not concatenation of previous inputs.
I thought memory to be of type transofrmer-xl one, as self attention code already have relative position encoding so it will need xl style memory implementation and some data pipeline change for transformer xl architecture.
Is this clear now? Any suggestion how to improve the documentation? Otherwise I will close this now.
yes, it is clear now.
I was looking in SelfAttentionLayer code, going through how it is implemented, almost got all things including relative positional encoding but not clear in memory part? I am confused about type of memory being reused. Is memory refers layer Lth outputs of a particular sequence from a data batch being used in corresponding sequence of next data batch as input to (L+1)th layer or some other kind of memory? How it should be specified for use in returnn config?