About Attention Mask - Githubissues

microsoft / SwinBERT

Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"

MIT License

238 stars 34 forks source link

Hi, thanks for your code sharing.

I am confused about the values of attention_mask in your codes, which are either 0 or 1 if they are not learnable.

As the attention_mask will be added to the attention_scores before the softmax operation, shouldn't the values of attention_mask be either 0 or a large negative value (e.g., -1e6)?

By adding large negative values to some positions, we can make sure that these positions are not observable (i.e., they will not be attended to) for the query token. Your implementations doesn't seem to guarantee that. Do I get it wrong?

microsoft / SwinBERT

About Attention Mask #19