lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch
MIT License
1.08k stars 141 forks source link

Question about masking #89

Open Microbiods opened 2 years ago

Microbiods commented 2 years ago

Hi, thanks for the wonderful repo, I am new in BERT, so I 'd like to make sure in your example:

model = PerformerLM() x = torch.randint(0, 20000, (1, 2048)) mask = torch.ones_like(x).bool() model(x, mask = mask) # (1, 2048, 20000)

is this 'mask' is attention_mask? i.e., TRUE (1) for normal tokens and FALSE (0) for padding tokens? Or set 1 to indicate padding token? Thanks a lot!

BomanNg commented 1 year ago

@Microbiods this attention mask is just the same as the attention mask of BERT. FALSE for padding tokens

jdefriel commented 1 year ago

Another question on masking: in normal transformer the mask is implemented after QK^T and before softmax but since we do KV first when is the mask implemented? After full attention is calculated?