microsoft / SwinBERT

Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"
https://arxiv.org/abs/2111.13196
MIT License
238 stars 34 forks source link

About Attention Mask #19

Open yangbang18 opened 2 years ago

yangbang18 commented 2 years ago

Hi, thanks for your code sharing.

I am confused about the values of attention_mask in your codes, which are either 0 or 1 if they are not learnable.

As the attention_mask will be added to the attention_scores before the softmax operation, shouldn't the values of attention_mask be either 0 or a large negative value (e.g., -1e6)?

By adding large negative values to some positions, we can make sure that these positions are not observable (i.e., they will not be attended to) for the query token. Your implementations doesn't seem to guarantee that. Do I get it wrong?

kevinlin311tw commented 2 years ago

In the code, we provide an option that applies the thresholding to obtain the binary mask. This is only used in the testing stage.

In training, we use soft binary mask which does not apply the thresholding.