I am confused about the values of attention_mask in your codes, which are either 0 or 1 if they are not learnable.
As the attention_mask will be added to the attention_scores before the softmax operation, shouldn't the values of attention_mask be either 0 or a large negative value (e.g., -1e6)?
By adding large negative values to some positions, we can make sure that these positions are not observable (i.e., they will not be attended to) for the query token. Your implementations doesn't seem to guarantee that. Do I get it wrong?
Hi, thanks for your code sharing.
I am confused about the values of
attention_mask
in your codes, which are either 0 or 1 if they are not learnable.As the
attention_mask
will be added to theattention_scores
before the softmax operation, shouldn't the values ofattention_mask
be either 0 or a large negative value (e.g., -1e6)?By adding large negative values to some positions, we can make sure that these positions are not observable (i.e., they will not be attended to) for the query token. Your implementations doesn't seem to guarantee that. Do I get it wrong?