Open Microbiods opened 2 years ago
@Microbiods this attention mask is just the same as the attention mask of BERT. FALSE for padding tokens
Another question on masking: in normal transformer the mask is implemented after QK^T and before softmax but since we do KV first when is the mask implemented? After full attention is calculated?
Hi, thanks for the wonderful repo, I am new in BERT, so I 'd like to make sure in your example:
model = PerformerLM() x = torch.randint(0, 20000, (1, 2048)) mask = torch.ones_like(x).bool() model(x, mask = mask) # (1, 2048, 20000)
is this 'mask' is attention_mask? i.e., TRUE (1) for normal tokens and FALSE (0) for padding tokens? Or set 1 to indicate padding token? Thanks a lot!