input_ids, attention_mask = input_ids.to(self.device), attention_mask.to(self.device)
D = self.bert(input_ids, attention_mask=attention_mask)[0]
D = self.linear(D)
mask = torch.tensor(self.mask(input_ids, skiplist=self.skiplist), device=self.device).unsqueeze(2).float()
D = D * mask
This means that the token embeddings for skiplist and pad_token_id are zeroed in the output, however their token will be considered in the bert implementation.
Is this expected? Should attention_mask also mask out those values in the same way as input_ids?
In the
forward
pass implementation we have:This means that the token embeddings for
skiplist
andpad_token_id
arezeroed
in the output, however their token will be considered in thebert
implementation.Is this expected? Should
attention_mask
also mask out those values in the same way as input_ids?