Open imamnurby opened 1 year ago
tgt_mask
refers to the mask attention matrix of the target translation, denoted as A. The target attention score will add this tgt_mask
. Thus, A_ij = -inf mean the i-th token doesn't attend to j-tih token.
memory_key_padding_mask
controls how the token in the decoder can attend the token in the encoder. 0 indicates that a decoder token can attend to an encoder token, 1 otherwise
Yes, we allow the decoder tokens to attend to the node tokens in the encoder output.
Dear authors,
In the following line,
you compute the output of the decoder by supplying 4 params. Please clarify if my understanding is correct:
tgt_mask
refers to the mask of the target translation. 1 indicates a non-padding token, 0 otherwise.memory_key_padding_mask
controls how the token in the decoder can attend the token in the encoder. 1 indicates that a decoder token can attend to an encoder token, 0 otherwise)Regarding the
memory_key_padding_mask
, is it intended that you allow the decoder tokens to attend to the node tokens in the encoder output?Here is the link for the snippet above.
Edit: fix some typos