microsoft / CodeBERT

CodeBERT
MIT License
2.23k stars 453 forks source link

Attention mask in the decoder when using GraphCodeBERT in a translation task #219

Open imamnurby opened 1 year ago

imamnurby commented 1 year ago

Dear authors,

In the following line,

out = self.decoder(tgt_embeddings,encoder_output,tgt_mask=attn_mask,memory_key_padding_mask=(1-source_mask).bool())

you compute the output of the decoder by supplying 4 params. Please clarify if my understanding is correct:

Regarding the memory_key_padding_mask, is it intended that you allow the decoder tokens to attend to the node tokens in the encoder output?

source_tokens =[tokenizer.cls_token]+code_tokens+[tokenizer.sep_token]
....
# here, you add the node tokens to the source_tokens variable
source_tokens += [x[0] for x in dfg_before]
source_tokens += [x[0] for x in dfg_after]
...
source_mask = [1] * (len(source_tokens))

Here is the link for the snippet above.

Edit: fix some typos

guoday commented 1 year ago

tgt_mask refers to the mask attention matrix of the target translation, denoted as A. The target attention score will add this tgt_mask. Thus, A_ij = -inf mean the i-th token doesn't attend to j-tih token.

memory_key_padding_mask controls how the token in the decoder can attend the token in the encoder. 0 indicates that a decoder token can attend to an encoder token, 1 otherwise

Yes, we allow the decoder tokens to attend to the node tokens in the encoder output.